OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
David Sterba	8bfc9b2cf4	btrfs: use enum for btrfs_block_rsv::type The number of block group reserve types BTRFS_BLOCK_RSV_* is small and fits to u8 and there's enough left in case we want to add more. For type safety use the enum but make it 8 bits in the structure to save space. The structure size is now 48 on release build, making a slight improvement in structures where it's embedded, like btrfs_fs_info or btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
David Sterba	710d5921d1	btrfs: switch btrfs_block_rsv::failfast to bool Use simple bool type for the block reserve failfast status, there's short to save space as there used to be int but there's no reason for that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
David Sterba	c70c2c5bc9	btrfs: switch btrfs_block_rsv::full to bool Use simple bool type for the block reserve full status, there's short to save space as there used to be int but there's no reason for that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	37899117e8	btrfs: do not return errors from btrfs_submit_dio_bio Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission and the other btrfs bio submission handlers do and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	ea1f0cedef	btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully btrfs_wq_submit_bio is used for writeback under memory pressure. Instead of failing the I/O when we can't allocate the async_submit_bio, just punt back to the synchronous submission path. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	82443fd55c	btrfs: simplify sync/async submission in btrfs_submit_data_write_bio btrfs_submit_data_write_bio special cases the reloc root because the checksums are preloaded, but only does so for the !sync case. The sync case can't happen for data relocation, but just handling it more generally significantly simplifies the logic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	b9af128d1e	btrfs: raid56: transfer the bio counter reference to the raid submission helpers Transfer the bio counter reference acquired by btrfs_submit_bio to raid56_parity_write and raid56_parity_recovery together with the bio that the reference was acquired for instead of acquiring another reference in those helpers and dropping the original one in btrfs_submit_bio. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	6065fd95da	btrfs: do not return errors from raid56_parity_recover Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Also use the proper bool type for the generic_io argument. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Christoph Hellwig	31683f4aae	btrfs: do not return errors from raid56_parity_write Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Christoph Hellwig	1a722d8f5b	btrfs: do not return errors from btrfs_map_bio Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Qu Wenruo	462b0b2a86	btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() For profiles other than RAID56, __btrfs_map_block() returns @map_length as min(stripe_end, logical + *length), which is also the same result from btrfs_get_io_geometry(). But for RAID56, __btrfs_map_block() returns @map_length as stripe_len. This strange behavior is going to hurt incoming bio split at btrfs_map_bio() time, as we will use @map_length as bio split size. Fix this behavior by returning @map_length by the same calculation as for other profiles. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Christoph Hellwig	ff18a4afeb	btrfs: raid56: use fixed stripe length everywhere The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there are functions passing it as arguments, this is not necessary. The fixed value has been used for a long time and though the stripe length should be configurable by super block member stripesize, this hasn't been implemented and would require more changes so we don't need to keep this code around until then. Partially based on a patch from Qu Wenruo. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Filipe Manana	0201fceb9f	btrfs: remove the inode cache check at btrfs_is_free_space_inode() The inode cache feature was removed in kernel 5.11, and we no longer have any code that reads from or writes to inode caches. We may still mount a filesystem that has inode caches, but they are ignored. Remove the check for an inode cache from btrfs_is_free_space_inode(), since we no longer have code to trigger reads from an inode cache or writes to an inode cache. The check at send.c is still needed, because in case we find a filesystem with an inode cache, we must ignore it. Also leave the checks at tree-checker.c, as they are sanity checks. This eliminates a dead branch and reduces the amount of code since it's in an inline function. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	74860816e8	btrfs: sysfs: remove BIG_METADATA feature files This flag has been merged in 3.10 and is effectively always-on. Its status depends on the host page size so there's another way to guarantee compatibility with old kernels. Due to a bug introduced in `6f93e834fa` ("btrfs: fix upper limit for max_inline for page size 64K") the flag is not persisted among features in the superblock so it's not reliable. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	0766837b0d	btrfs: sysfs: remove MIXED_BACKREF feature file This feature has been the default for about 13 year. At this point it's safe to consider it an indispensable feature of BTRFS as such there's no need to advertise it in sysfs. Remove the global sysfs feature file, the per-filesystem feature file has never been there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	49f468c938	btrfs: don't print 'has skinny extents' anymore on mount Skinny extents have been a default mkfs feature since version 3.18 i (introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs: make skinny-metadata default") ). It really doesn't bring any value to users to simply remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	6b769dac21	btrfs: don't print 'flagging with big metadata' anymore on mount Added in commit `727011e07c` ("Btrfs: allow metadata blocks larger than the page size") in 2010 and it's been default for mkfs since 3.12 (2013). The message doesn't really convey any useful information to users. Remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
David Sterba	c1867eb33e	btrfs: clean up chained assignments The chained assignments may be convenient to write, but make readability a bit worse as it's too easy to overlook that there are several values set on the same line while this is rather an exception. Making it consistent everywhere avoids surprises. The pattern where inode times are initialized reuses the first value and the order is mtime, ctime. In other blocks the assignments are expanded so the order of variables is similar to the neighboring code. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
David Sterba	ac0677348f	btrfs: merge calculations for simple striped profiles in btrfs_rmap_block Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1) and RAID10 (map->sub_stripes is 2), with equivalent results. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	d09cb9e188	btrfs: use mask for all RAID1* profiles in btrfs_calc_avail_data_space There's a sequence of hard coded values for RAID1 profiles that are already stored in the raid_attr table that should be used instead. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Nikolay Borisov	e26b04c4c9	btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA Commit `6f93e834fa` seemingly inadvertently moved the code responsible for flagging the filesystem as having BIG_METADATA to a place where setting the flag was essentially lost. This means that filesystems created with kernels containing this bug (starting with 5.15) can potentially be mounted by older (pre-3.4) kernels. In reality chances for this happening are low because there are other incompat flags introduced in the mean time. Still the correct behavior is to set INCOMPAT_BIG_METADATA flag and persist this in the superblock. Fixes: `6f93e834fa` ("btrfs: fix upper limit for max_inline for page size 64K") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	c8a5f8ca9a	btrfs: print checksum type and implementation at mount time Per user request, print the checksum type and implementation at mount time among the messages. The checksum is user configurable and the actual crypto implementation is useful to see for performance reasons. The same information is also available after mount in /sys/fs/FSID/checksum file. Example: [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm Link: https://github.com/kdave/btrfs-progs/issues/483 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Josef Bacik	1314ca78b2	btrfs: reset block group chunk force if we have to wait If you try to force a chunk allocation, but you race with another chunk allocation, you will end up waiting on the chunk allocation that just occurred and then allocate another chunk. If you have many threads all doing this at once you can way over-allocate chunks. Fix this by resetting force to NO_FORCE, that way if we think we need to allocate we can, otherwise we don't force another chunk allocation if one is already happening. Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	4824735918	btrfs: send: add new command FILEATTR for file attributes There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS and later from XFLAGS interfaces, now commonly found under the 'fileattr' API. This corresponds to the individual inode bits and that's part of the on-disk format, so this is suitable for the protocol. The other interfaces contain a lot of cruft or bits that btrfs does not support yet. Currently the value is u64 and matches btrfs_inode_item. Not all the bits can be set by ioctls (like NODATASUM or READONLY), but we can send them over the protocol and leave it up to the receiving side what and how to apply. As some of the flags, eg. IMMUTABLE, can prevent any further changes, the receiving side needs to understand that and apply the changes in the right order, or possibly with some intermediate steps. This should be easier, future proof and simpler on the protocol layer than implementing in kernel. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	22a5b2abb7	btrfs: send: add OTIME as utimes attribute for proto 2+ by default When send v1 was introduced the otime (inode creation time) was not available, however the attribute in btrfs send protocol exists. Though it would be possible to add it for v1 too as the attribute would be ignored by v1 receive, let's not change the layout of v1 and only add that to v2+. The otime cannot be changed and is only informative. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Qu Wenruo	8f0ed7d4e7	btrfs: output mirror number for bad metadata When handling a real world transid mismatch image, it's hard to know which copy is corrupted, as the error messages just look like this: BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 We don't even know if the retry is caused by btrfs or the VFS retry. To make things a little easier to read, add mirror number for all related tree block read errors. So the above messages would look like this: BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 Signed-off-by: Qu Wenruo <wqu@suse.com> [ update messages, add "logical" ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	aaafa1ebd6	btrfs: replace unnecessary goto with direct return at cow_file_range() The 'goto out' in cow_file_range() in the exit block are not necessary and jump back. Replace them with return, while still keeping 'goto out' in the main code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ keep goto in the main code, update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	71aa147b4d	btrfs: fix error handling of fallback uncompress write When cow_file_range() fails in the middle of the allocation loop, it unlocks the pages but leaves the ordered extents intact. Thus, we need to call btrfs_cleanup_ordered_extents() to finish the created ordered extents. Also, we need to call end_extent_writepage() if locked_page is available because btrfs_cleanup_ordered_extents() never processes the region on the locked_page. Furthermore, we need to set the mapping as error if locked_page is unavailable before unlocking the pages, so that the errno is properly propagated to the user space. CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	99826e4cab	btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it is not usable for submit_uncompressed_range() which can have NULL locked_page. Add support supports locked_page == NULL case. Also, it rewrites redundant "page_offset(locked_page)". Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	9ce7466f37	btrfs: ensure pages are unlocked on cow_file_range() failure There is a hung_task report on zoned btrfs like below. https://github.com/naota/linux/issues/59 [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds. [726.329839] Not tainted 5.16.0-rc1+ #1 [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000 [726.331608] Call Trace: [726.331611] <TASK> [726.331614] __schedule+0x2e5/0x9d0 [726.331622] schedule+0x58/0xd0 [726.331626] io_schedule+0x3f/0x70 [726.331629] __folio_lock+0x125/0x200 [726.331634] ? find_get_entries+0x1bc/0x240 [726.331638] ? filemap_invalidate_unlock_two+0x40/0x40 [726.331642] truncate_inode_pages_range+0x5b2/0x770 [726.331649] truncate_inode_pages_final+0x44/0x50 [726.331653] btrfs_evict_inode+0x67/0x480 [726.331658] evict+0xd0/0x180 [726.331661] iput+0x13f/0x200 [726.331664] do_unlinkat+0x1c0/0x2b0 [726.331668] __x64_sys_unlink+0x23/0x30 [726.331670] do_syscall_64+0x3b/0xc0 [726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae [726.331677] RIP: 0033:0x7fb9490a171b [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300 [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000 [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000 [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260 [726.331693] </TASK> While we debug the issue, we found running fstests generic/551 on 5GB non-zoned null_blk device in the emulated zoned mode also had a similar hung issue. Also, we can reproduce the same symptom with an error injected cow_file_range() setup. The hang occurs when cow_file_range() fails in the middle of allocation. cow_file_range() called from do_allocation_zoned() can split the give region ([start, end]) for allocation depending on current block group usages. When btrfs can allocate bytes for one part of the split regions but fails for the other region (e.g. because of -ENOSPC), we return the error leaving the pages in the succeeded regions locked. Technically, this occurs only when @unlock == 0. Otherwise, we unlock the pages in an allocated region after creating an ordered extent. Considering the callers of cow_file_range(unlock=0) won't write out the pages, we can unlock the pages on error exit from cow_file_range(). So, we can ensure all the pages except @locked_page are unlocked on error case. In summary, cow_file_range now behaves like this: - page_started == 1 (return value) - All the pages are unlocked. IO is started. - unlock == 1 - All the pages except @locked_page are unlocked in any case - unlock == 0 - On success, all the pages are locked for writing out them - On failure, all the pages except @locked_page are unlocked Fixes: `42c0110009` ("btrfs: zoned: introduce dedicated data write path for zoned filesystems") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Ioannis Angelakopoulos	140a8ff765	btrfs: sysfs: export commit stats Export commit stats in file /sys/fs/btrfs/UUID/commit_stats with example output like: commits 123 last_commit_ms 11 max_commit_ms 150 total_commit_ms 2000 The values are in one file so reading them at a single time will give a more consistent view. The stats are internally tracked in nanoseconds so the cumulative values should not suffer from rounding errors. Writing 0 to the file 'commit_stats' will reset max_commit_ms. Initial values are set at first mount of the filesystem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Ioannis Angelakopoulos	e55958c8a0	btrfs: collect commit stats, count, duration Track several stats about transaction commit, to be later exported via sysfs: - number of commits so far - duration of the last commit in ns - maximum commit duration seen so far in ns - total duration for all commits so far in ns The update of the commit stats occurs after the commit thread has gone through all the logic that checks if there is another thread committing at the same time. This means that we only account for actual commit work in the commit stats we report and not the time the thread spends waiting until it is ready to do the commit work. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Christoph Hellwig	f3e90c1ca9	btrfs: remove extent writepage address space operation Same as in commit `21b4ee7029` ("xfs: drop ->writepage completely"): we can remove the callback as it's only used in one place - single page writeback from memory reclaim and is not called for cgroup writeback at all. We only allow such writeback from kswapd, not from direct memory reclaim, and so it is rarely used. When it comes from kswapd, it is effectively random dirty page shoot-down, which is horrible for IO patterns. We can rely on background writeback to clean all dirty pages in an efficient way and not let it be interrupted by kswapd. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	9555e1f188	btrfs: send: use boolean types for current inode status The new, new_gen and deleted indicate a status, use boolean type instead of int. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	cec3dad943	btrfs: send: remove old TODO regarding ERESTARTSYS The whole send operation is restartable and handling properly a buffer write may not be easy. We can't know what caused that and if a short delay and retry will fix it or how many retries should be performed in case it's a temporary condition. The error value is returned to the ioctl caller so in case it's transient problem, the user would be notified about the reason. Remove the TODO note as there's no plan to handle ERESTARTSYS. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	8234d3f658	btrfs: send: simplify includes We don't need the whole ctree.h in send.h, none of the data types defined there are used. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	e3b4b9040b	btrfs: send: drop __KERNEL__ ifdef from send.h We don't need this ifdef as the header file is not shared, the protocol definition used by userspace should be from libbtrfs or libbtrfsutil. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Christoph Hellwig	ee5b46a353	btrfs: increase direct io read size limit to 256 sectors Btrfs currently limits direct I/O reads to a single sector, which goes back to commit `c329861da4` ("Btrfs: don't allocate a separate csums array for direct reads") from Josef. That commit changes the direct I/O code to ".. use the private part of the io_tree for our csums.", but ten years later that isn't how checksums for direct reads work, instead they use a csums allocation on a per-btrfs_dio_private basis (which have their own performance problem for small I/O, but that will be addressed later). There is no fundamental limit in btrfs itself to limit the I/O size except for the size of the checksum array that scales linearly with the number of sectors in an I/O. Pick a somewhat arbitrary limit of 256 limits, which matches what the buffered reads typically see as the upper limit as the limit for direct I/O as well. This significantly improves direct read performance. For example a fio run doing 1 MiB aio reads with a queue depth of 1 roughly triples the throughput: Baseline: READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec With this patch: READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Qu Wenruo	f6065f8ede	btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() [BUG] There is a small workload which will always fail with recent kernel: (A simplified version from btrfs/125 test case) mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1 sync umount $mnt btrfs dev scan -u $dev3 mount -o degraded $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2 umount $mnt btrfs dev scan mount $dev1 $mnt btrfs balance start --full-balance $mnt umount $mnt The failure is always failed to read some tree blocks: BTRFS info (device dm-4): relocating block group 217710592 flags data\|raid5 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 ... [CAUSE] With the recently added debug output, we can see all RAID56 operations related to full stripe 38928384: 56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384 56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384 56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384 56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384 56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384 56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 Before we enter raid56_parity_recover(), we have triggered some metadata write for the full stripe 38928384, this leads to us to read all the sectors from disk. Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to avoid unnecessary read. This means, for that full stripe, after any partial write, we will have stale data, along with P/Q calculated using that stale data. Thankfully due to patch "btrfs: only write the sectors in the vertical stripe which has data stripes" we haven't submitted all the corrupted P/Q to disk. When we really need to recover certain range, aka in raid56_parity_recover(), we will use the cached rbio, along with its cached sectors (the full stripe is all cached). This explains why we have no event raid56_scrub_read_recover() triggered. Since we have the cached P/Q which is calculated using the stale data, the recovered one will just be stale. In our particular test case, it will always return the same incorrect metadata, thus causing the same error message "parent transid verify failed on 39010304 wanted 9 found 7" again and again. [BTRFS DESTRUCTIVE RMW PROBLEM] Test case btrfs/125 (and above workload) always has its trouble with the destructive read-modify-write (RMW) cycle: 0 32K 64K Data1: \| Good \| Good \| Data2: \| Bad \| Bad \| Parity: \| Good \| Good \| In above case, if we trigger any write into Data1, we will use the bad data in Data2 to re-generate parity, killing the only chance to recovery Data2, thus Data2 is lost forever. This destructive RMW cycle is not specific to btrfs RAID56, but there are some btrfs specific behaviors making the case even worse: - Btrfs will cache sectors for unrelated vertical stripes. In above example, if we're only writing into 0~32K range, btrfs will still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2. This behavior is to cache sectors for later update. Incidentally commit `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") has a bug which makes RAID56 to never trust the cached sectors, thus slightly improve the situation for recovery. Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in steal_rbio" will revert the behavior back to the old one. - Btrfs raid56 partial write will update all P/Q sectors and cache them This means, even if data at (64K ~ 96K) of Data2 is free space, and only (96K ~ 128K) of Data2 is really stale data. And we write into that (96K ~ 128K), we will update all the parity sectors for the full stripe. This unnecessary behavior will completely kill the chance of recovery. Thankfully, an unrelated optimization "btrfs: only write the sectors in the vertical stripe which has data stripes" will prevent submitting the write bio for untouched vertical sectors. That optimization will keep the on-disk P/Q untouched for a chance for later recovery. [FIX] Although we have no good way to completely fix the destructive RMW (unless we go full scrub for each partial write), we can still limit the damage. With patch "btrfs: only write the sectors in the vertical stripe which has data stripes" now we won't really submit the P/Q of unrelated vertical stripes, so the on-disk P/Q should still be fine. Now we really need to do is just drop all the cached sectors when doing recovery. By this, we have a chance to read the original P/Q from disk, and have a chance to recover the stale data, while still keep the cache to speed up regular write path. In fact, just dropping all the cache for recovery path is good enough to allow the test case btrfs/125 along with the small script to pass reliably. The lack of metadata write after the degraded mount, and forced metadata COW is saving us this time. So this patch will fix the behavior by not trust any cache in __raid56_parity_recover(), to solve the problem while still keep the cache useful. But please note that this test pass DOES NOT mean we have solved the destructive RMW problem, we just do better damage control a little better. Related patches: - btrfs: only write the sectors in the vertical stripe - `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") - btrfs: update stripe_sectors::uptodate in steal_rbio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Christoph Hellwig	711f447b4f	btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished finish_func is always set to finish_ordered_fn, so remove it and also the now pointless and somewhat confusingly named __endio_write_update_ordered wrapper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Nikolay Borisov	1f4f639fe7	btrfs: batch up release of reserved metadata for delayed items used for deletion With Filipe's recent rework of the delayed inode code one aspect which isn't batched is the release of the reserved metadata of delayed inode's delete items. With this patch on top of Filipe's rework and running the same test as provided in the description of a patch titled "btrfs: improve batch deletion of delayed dir index items" I observe the following change of the number of calls to btrfs_block_rsv_release: Before this change: - block_rsv_release: 1004 - btrfs_delete_delayed_items_total_time: 14602 - delete_batches: 505 After: - block_rsv_release: 510 - btrfs_delete_delayed_items_total_time: 13643 - delete_batches: 507 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Qu Wenruo	3613249a1b	btrfs: warn about dev extents that are inside the reserved range Btrfs on-disk format has reserved the first 1MiB for the primary super block (at 64KiB offset) and bootloaders may also use this space. This behavior is only introduced since v4.1 btrfs-progs release, although kernel can ensure we never touch the reserved range of super blocks, it's better to inform the end users, and a balance will resolve the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update changelog and message ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	37f85ec320	btrfs: use named constant for reserved device space There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	bfceac7fd3	btrfs: remove unused typedefs get_extent_t and btrfs_work_func_t Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	e3059ec06b	btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino There's only one function we pass to iterate_inodes_from_logical as iterator, so we can drop the indirection and call it directly, after moving the function to backref.c Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	875d1daa7b	btrfs: simplify parameters of backref iterators The inode reference iterator interface takes parameters that are derived from the context parameter, but as it's a void* type the values are passed individually. Change the ctx type to inode_fs_path as it's the only thing we pass and drop any parameters that are derived from that. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	ad6240f662	btrfs: call inode_to_path directly and drop indirection The functions for iterating inode reference take a function parameter but there's only one value, inode_to_path(). Remove the indirection and call the function. As paths_from_inode would become just an alias for iterate_irefs(), merge the two into one function. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	6d322b4839	btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies() For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies directly, only for RAID5 and RAID6 we need some extra handling as there's no table value for that. For RAID10 there's a change from sub_stripes to ncopies. The values are the same but semantically we want to use number of copies, as this is what btrfs_num_copies does. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	0b30f71945	btrfs: use btrfs_raid_array to calculate number of parity stripes Use the raid table instead of hard coded values and rename the helper as it is exported. This could make later extension on RAID56 based profiles easier. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	6dead96c1a	btrfs: use btrfs_chunk_max_errors() to replace tolerance calculation In __btrfs_map_block() we have an assignment to @max_errors using nr_parity_stripes(). Although it works for RAID56 it's confusing. Replace it with btrfs_chunk_max_errors(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	bc88b486d5	btrfs: remove parameter dev_extent_len from scrub_stripe() For scrub_stripe() we can easily calculate the dev extent length as we have the full info of the chunk. Thus there is no need to pass @dev_extent_len from the caller, and we introduce a helper, btrfs_calc_stripe_length(), to do the calculation from extent_map structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	9db33891c7	btrfs: unify tree search helper returning prev and next nodes Simplify helper to return only next and prev pointers, we don't need all the node/parent/prev/next pointers of __etree_search as there are now other specialized helpers. Rename parameters so they follow the naming. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	ec60c76f53	btrfs: make tree search for insert more generic and use it for tree_search With a slight extension of tree_search_for_insert (fill the return node and parent return parameters) we can avoid calling __etree_search from tree_search, that could be removed eventually in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	bebb22c13d	btrfs: open code inexact rbtree search in tree_search The call chain from tree_search tree_search_for_insert __etree_search can be open coded and allow further simplifications, here we need a tree search with fallback to the next node in case it's not found. This is represented as __etree_search parameters next_ret=valid, prev_ret=NULL. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	c367602a78	btrfs: remove node and parent parameters from insert_state There's no caller left that would pass valid pointers to insert_state so we can drop them. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	fb8f07d2d8	btrfs: add fast path for extent_state insertion In two cases the exact location where to insert the extent state is known at the call time so we don't need to pass it to insert_state that takes the fast path. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	6d92b304ec	btrfs: pass bits by value not by pointer for extent_state helpers The bits are passed to all extent state helpers for no apparent reason, the value only read and never updated so remove the indirection and pass it directly. Also unify the type to u32 where needed. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	cee5126825	btrfs: lift start and end parameters to callers of insert_state Let callers of insert_state to set up the extent state to allow further simplifications of the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	c7e118cf98	btrfs: open code rbtree search in insert_state The rbtree search is a known pattern and can be open coded, allowing to remove the tree_insert and further cleanups. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	12c9cdda62	btrfs: open code rbtree search in split_state Preparatory work to remove tree_insert from extent_io.c, the rbtree search loop is a known and simple so it can be open coded. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	1c10702e7c	btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() Originally it's iterating all the sectors which has dbitmap sector for the vertical stripe. It can be easily converted to sector bytenr iteration with an test_bit() call. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	550cdeb3e0	btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() This function doesn't even utilize full stripe skip, just iterate all the data sectors is definitely enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	aee35e4bcc	btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() The double loop is just checking if the page for the vertical stripe is allocated. We can easily convert it to single loop and get rid of @stripe variable. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	ef340fccbe	btrfs: raid56: avoid double for loop inside __raid56_parity_recover() The double for loop can be easily converted to single for loop as we're really iterating the sectors in their bytenr order. The only exception is the full stripe skip, however that can also easily be done inside the loop. Add an ASSERT() along with a comment for that specific case. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	3692004465	btrfs: raid56: avoid double for loop inside finish_rmw() We can easily calculate the stripe number and sector number inside the stripe. Thus there is not much need for a double for loop. For the only case we want to skip the whole stripe, we can manually increase @total_sector_nr. This is not a recommended behavior, thus every time the iterator gets modified there will be a comment along with an ASSERT() for it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Josef Bacik	f31f09f6be	btrfs: tree-log: make the return value for log syncing consistent Currently we will return 1 or -EAGAIN if we decide we need to commit the transaction rather than sync the log. In practice this doesn't really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to commit the transaction. However this makes it hard to figure out what the correct thing to do is. Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the places where we want to force the transaction to be committed. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Johannes Thumshirn	5bea250881	btrfs: add tracepoints for ordered extents When debugging a reference counting issue with ordered extents, I've found we're lacking a lot of tracepoint coverage in the ordered extent code. Close these gaps by adding tracepoints after every refcount_inc() in the ordered extent code. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
David Sterba	15dcccdb8b	btrfs: sysfs: advertise zoned support among features We've hidden the zoned support in sysfs under debug config for the first releases but now the stability is reasonable, though not all features have been implemented. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	a4012f06f1	btrfs: split discard handling out of btrfs_map_block Mapping block for discard doesn't really share any code with the regular block mapping case. Split it out into an entirely separate helper that just returns an array of btrfs_discard_stripe structures and the number of stripes. This removes the need for the length field in the btrfs_io_context structure, so remove tht. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	5eecef7108	btrfs: stop looking at btrfs_bio->iter in index_one_bio All the bios that index_one_bio operates on are the bios submitted by the upper layer. These are never resubmitted to an actual device by the raid56 code, and thus the iter never changes from the initial state. Thus we can always just use bi_iter directly as it will be the same as the saved copy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Qu Wenruo	dc4d316849	btrfs: reject log replay if there is unsupported RO compat flag [BUG] If we have a btrfs image with dirty log, along with an unsupported RO compatible flag: log_root 30474240 ... compat_flags 0x0 compat_ro_flags 0x40000003 ( FREE_SPACE_TREE \| FREE_SPACE_TREE_VALID \| unknown flag: 0x40000000 ) Then even if we can only mount it RO, we will still cause metadata update for log replay: BTRFS info (device dm-1): flagging fs with big metadata feature BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): has skinny extents BTRFS info (device dm-1): start tree-log replay This is definitely against RO compact flag requirement. [CAUSE] RO compact flag only forces us to do RO mount, but we will still do log replay for plain RO mount. Thus this will result us to do log replay and update metadata. This can be very problematic for new RO compat flag, for example older kernel can not understand v2 cache, and if we allow metadata update on RO mount and invalidate/corrupt v2 cache. [FIX] Just reject the mount unless rescue=nologreplay is provided: BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead We don't want to set rescue=nologreply directly, as this would make the end user to read the old data, and cause confusion. Since the such case is really rare, we're mostly fine to just reject the mount with an error message, which also includes the proper workaround. CC: stable@vger.kernel.org #4.9+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Qu Wenruo	97f09d55f1	btrfs: make btrfs_super_block::log_root_transid deprecated When using "btrfs inspect-internal dump-super" to inspect an fs with dirty log, it always shows the log_root_transid as 0: log_root 30474240 log_root_transid 0 <<< log_root_level 0 It turns out that, btrfs_super_block::log_root_transid is never really utilized (even no read for it). This can date back to the introduction of btrfs into upstream kernel. In fact, when reading log tree root, we always use btrfs_super_block::generation + 1 as the expected generation. So here we're completely safe to mark this member deprecated. In theory we can easily reuse this member for other purposes, but to be extra safe, here we follow the leafsize way, by adding "__unused_" for log_root_transid. And we can safely remove the accessors, since there is no such callers from the very beginning. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	722c82ac9e	btrfs: pass the btrfs_bio_ctrl to submit_one_bio submit_one_bio always works on the bio and compression flags from a btrfs_bio_ctrl structure. Pass the explicitly and clean up the calling conventions by handling a NULL bio in submit_one_bio, and using the btrfs_bio_ctrl to pass the mirror number as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	9845e5ddcb	btrfs: merge end_write_bio and flush_write_bio Merge end_write_bio and flush_write_bio into a single submit_write_bio helper, that either submits the bio or ends it if a negative errno was passed in. This consolidates a lot of duplicated checks in the callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	2d5ac130fa	btrfs: don't use bio->bi_private to pass the inode to submit_one_bio submit_one_bio is only used for page cache I/O, so the inode can be trivially derived from the first page in the bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
David Sterba	234fdd2815	btrfs: remove redundant check in up check_setget_bounds There are two separate checks in the bounds checker, the first one being a special case of the second. As this function is performance critical due to checking access to any eb member, reducing the size can slightly improve performance. On a release build on x86_64 the helper is completely inlined so the function call overhead is also gone. There was a report of 5% performance drop on metadata heavy workload, that disappeared after disabling asserts. The most significant part of that is the bounds checker. https://lore.kernel.org/linux-btrfs/20200724164147.39925-1-josef@toxicpanda.com/ After the analysis, the optimized code removes the worst overhead which is the function call and the performance was restored. https://lore.kernel.org/linux-btrfs/20200730110943.GE3703@twin.jikos.cz/ 1. baseline, asserts on, setget check on run time: 46s run time with perf: 48s 2. asserts on, comment out setget check run time: 44s run time with perf: 47s So this is confirms the 5% difference 3. asserts on, optimized seget check run time: 44s run time with perf: 47s The optimizations are reducing the number of ifs to 1 and inlining the hot path. Low-level stuff, gets the performance back. Patch below. 4. asserts off, no setget check run time: 44s run time with perf: 45s This verifies that asserts other than the setget check have negligible impact on performance and it's not harmful to keep them on. Analysis where the performance is lost: * check_setget_bounds is short function, but it's still a function call, changing the flow of instructions and given how many times it's called the overhead adds up * there are two conditions, one to check if the range is completely outside (member_offset > eb->len) or partially inside (member_offset + size > eb->len) Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Fabio M. De Francesco	51c0674a56	btrfs: replace kmap() with kmap_local_page() in lzo.c The use of kmap() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the mappings are per thread and not globally visible. Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Fabio M. De Francesco	70826b6bd5	btrfs: replace kmap() with kmap_local_page() in inode.c The use of kmap() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the mappings are per thread and not globally visible. Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	9ff7ddd3c7	btrfs: do not allocate a btrfs_bio for low-level bios The bios submitted from btrfs_map_bio don't really interact with the rest of btrfs and the only btrfs_bio member actually used in the low-level bios is the pointer to the btrfs_io_context used for endio handler. Use a union in struct btrfs_io_stripe that allows the endio handler to find the btrfs_io_context and remove the spurious ->device assignment so that a plain fs_bio_set bio can be used for the low-level bios allocated inside btrfs_map_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	a316a25991	btrfs: factor stripe submission logic out of btrfs_map_bio Move all per-stripe handling into submit_stripe_bio and use a label to cleanup instead of duplicating the logic. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	d7b9416fe5	btrfs: remove btrfs_end_io_wq All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	08a6f46434	btrfs: centralize setting REQ_META Set REQ_META in btrfs_submit_metadata_bio instead of the various callers. We'll start relying on this flag inside of btrfs in a bit, and this ensures it is always set correctly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	fed8a72df1	btrfs: don't use btrfs_bio_wq_end_io for compressed writes Compressed write bio completion is the only user of btrfs_bio_wq_end_io for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal here as we only real need user context for the final completion of a compressed_bio structure, and not every single bio completion. Add a work_struct to struct compressed_bio instead and use that to call finish_compressed_bio_write. This allows to remove all handling of write bios in the btrfs_bio_wq_end_io infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	02bb5b7247	btrfs: don't double-defer bio completions for compressed reads The bio completion handler of the bio used for the compressed data is already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule the completion of the original bio to the same workqueue again but just execute it directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	d34e123de1	btrfs: defer I/O completion based on the btrfs_raid_bio Instead of attaching an extra allocation an indirect call to each low-level bio issued by the RAID code, add a work_struct to struct btrfs_raid_bio and only defer the per-rbio completion action. The per-bio action for all the I/Os are trivial and can be safely done from interrupt context. As a nice side effect this also allows sharing the boilerplate code for the per-bio completions Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	c93104e758	btrfs: split btrfs_submit_data_bio to read and write parts Split btrfs_submit_data_bio into one helper for reads and one for writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	e6484bd488	btrfs: simplify code flow in btrfs_submit_dio_bio There is no exit block and cleanup and the function is reasonably short so we can use inline return and not the goto. This makes the function more straight forward. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	b4c46bdea9	btrfs: move more work into btrfs_end_bioc Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of duplicating the logic in the callers. Also remove the bio argument as it always must be bioc->orig_bio and the now pointless bioc_error that did nothing but assign bi_sector to the same value just sampled in the caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	d681559280	btrfs: send: enable support for stream v2 and compressed writes Now that the new support is implemented, allow the ioctl to accept v2 and the compressed flag, and update the version in sysfs. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	3ea4dc5bf0	btrfs: send: send compressed extents with encoded writes Now that all of the pieces are in place, we can use the ENCODED_WRITE command to send compressed extents when appropriate. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	a4b333f227	btrfs: send: get send buffer pages for protocol v2 For encoded writes in send v2, we will get the encoded data with btrfs_encoded_read_regular_fill_pages(), which expects a list of raw pages. To avoid extra buffers and copies, we should read directly into the send buffer. Therefore, we need the raw pages for the send buffer. We currently allocate the send buffer with kvmalloc(), which may return a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page(). However, the buffer size we use (144K) is not a power of two, which in theory is not guaranteed to return a page-aligned buffer, and in practice would waste a lot of memory due to rounding up to the next power of two. 144K is large enough that it usually gets allocated with vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc() and save the pages in an array. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	356bbbb66b	btrfs: send: write larger chunks when using stream v2 The length field of the send stream TLV header is 16 bits. This means that the maximum amount of data that can be sent for one write is 64K minus one. However, encoded writes must be able to send the maximum compressed extent (128K) in one command, or more. To support this, send stream version 2 encodes the DATA attribute differently: it has no length field, and the length is implicitly up to the end of containing command (which has a 32bit length field). Although this is necessary for encoded writes, normal writes can benefit from it, too. Also add a check to enforce that the DATA attribute is last. It is only strictly necessary for v2, but we might as well make v1 consistent with it. For v2, let's bump up the send buffer to the maximum compressed extent size plus 16K for the other metadata (144K total). Since this will most likely be vmalloc'd (and always will be after the next commit), we round it up to the next page since we might as well use the rest of the page on systems with >16K pages. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	b7c14f23fb	btrfs: send: add stream v2 definitions This adds the definitions of the new commands for send stream version 2 and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a. chattr), and encoded writes. It also documents two changes to the send stream format in v2: the receiver shouldn't assume a maximum command size, and the DATA attribute is encoded differently to allow for writes larger than 64k. These will be implemented in subsequent changes, and then the ioctl will accept the new version and flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	54cab6aff8	btrfs: send: explicitly number commands and attributes Commit `e77fbf9903` ("btrfs: send: prepare for v2 protocol") added _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the version plus 1, but as written this creates gaps in the number space. The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2 has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that 23 and 24 are valid commands. Instead, let's explicitly number all of the commands, attributes, and sentinel MAX constants. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	ca182acc53	btrfs: send: remove unused send_ctx::{total,cmd}_send_size We collect these statistics but have never exposed them in any way. I also didn't find any patches that ever attempted to make use of them. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Stefan Roesch	22c55e3bbb	btrfs: sysfs: add force_chunk_alloc trigger to force allocation Adds write-only trigger to force new chunk allocation for a given block group type. It is at /sys/fs/btrfs/<uuid>/allocation/<type>/force_chunk_alloc Note: this is now only for debugging and testing and is enabled with the CONFIG_BTRFS_DEBUG configuration option. The transaction is started from sysfs context and can be problematic in some cases. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ Changes from the original submission: - update changelog - drop unnecessary error messages - switch value to bool and use kstrtobool - move BTRFS_ATTR_W definition - add comment for using transaction ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Stefan Roesch	19fc516a51	btrfs: sysfs: export chunk size in space infos Add new sysfs knob /sys/fs/btrfs/<uuid>/allocation/<type>/chunk_size. This allows to query the chunk size and also set the chunk size. Constraints: - can be changed by root only - system chunk size can't be set - maximum chunk size is 10% of the filesystem size - final value is rounded down to a multiple of 256M - cannot be set on zoned filesystem Note, that rounding and the 10% clamp will result to a different value on filesystems smaller than 10G, typically 768M. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ Changes to original submission: - document setting constraints - drop read-only requirement - drop unnecessary error messages - fix return values of _store callback - use memparse for the value - fix rounding down to 256M ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Stefan Roesch	f6fca3917b	btrfs: store chunk size in space-info struct The chunk size is stored in the btrfs_space_info structure. It is initialized at the start and is then used. A new API is added to update the current chunk size. This API is used to be able to expose the chunk_size as a sysfs setting. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename and merge helpers, switch atomic type to u64, style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Josef Bacik	71b68e9e35	btrfs: do not batch insert non-consecutive dir indexes during log replay While running generic/475 in a loop I got the following error BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524) <snip> item 65 key (263 96 517) itemoff 14132 itemsize 33 item 66 key (263 96 523) itemoff 14099 itemsize 33 item 67 key (263 96 525) itemoff 14066 itemsize 33 item 68 key (263 96 531) itemoff 14033 itemsize 33 item 69 key (263 96 524) itemoff 14000 itemsize 33 As you can see here we have 3 dir index keys with the dir index value of 523, 524, and 525 inserted between 517 and 524. This occurs because our dir index insertion code will bulk insert all dir index items on the node regardless of their actual key value. This makes sense on a normally running system, because if there's a gap in between the items there was a deletion before the item was inserted, so there's not going to be an overlap of the dir index items that need to be inserted and what exists on disk. However during log replay this isn't necessarily true, we could have any number of dir indexes in the tree already. Fix this by seeing if we're replaying the log, and if we are simply skip batching if there's a gap in the key space. This file system was left broken from the fstest, I tested this patch against the broken fs to make sure it replayed the log properly, and then btrfs checked the file system after the log replay to verify everything was ok. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:14 +02:00
Filipe Manana	763748b238	btrfs: reduce amount of reserved metadata for delayed item insertion Whenever we want to create a new dir index item (when creating an inode, create a hard link, rename a file) we reserve 1 unit of metadata space for it in a transaction (that's 256K for a node/leaf size of 16K), and then create a delayed insertion item for it to be added later to the subvolume's tree. That unit of metadata is kept until the delayed item is inserted into the subvolume tree, which may take a while to happen (in the worst case, it's done only when the transaction commits). If we have multiple dir index items to insert for the same directory, say N index items, and they all fit in a single leaf of metadata, then we are holding N units of reserved metadata space when all we need is 1 unit. This change addresses that, whenever a new delayed dir index item is added, we release the unit of metadata the caller has reserved when it started the transaction if adding that new dir index item does not result in touching one more metadata leaf, otherwise the reservation is kept by transferring it from the transaction block reserve to the delayed items block reserve, just like before. Given that with a leaf size of 16K we can have a few hundred dir index items in a single leaf (the exact value depends on file name lengths), this reduces pressure on metadata reservation by releasing unnecessary space much sooner. The following fs_mark test showed some improvement when creating many files in parallel on machine running a non debug kernel (debian's default kernel config) with 12 cores: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before: FSUse% Count Size Files/sec App Overhead 2 1200000 0 225991.3 5465891 4 2400000 0 345728.1 5512106 4 3600000 0 346959.5 5557653 8 4800000 0 329643.0 5587548 8 6000000 0 312657.4 5606717 8 7200000 0 281707.5 5727985 12 8400000 0 88309.8 5020422 12 9600000 0 85835.9 5207496 16 10800000 0 81039.2 5404964 16 12000000 0 58548.6 5842468 After: FSUse% Count Size Files/sec App Overhead 2 1200000 0 230604.5 5778375 4 2400000 0 348908.3 5508072 4 3600000 0 357028.7 5484337 6 4800000 0 342898.3 5565703 6 6000000 0 314670.8 5751555 8 7200000 0 282548.2 5778177 12 8400000 0 90844.9 5306819 12 9600000 0 86963.1 5304689 16 10800000 0 89113.2 5455248 16 12000000 0 86693.5 5518933 The "after" results are after applying this patch and all the other patches in the same patchset, which is comprised of the following changes: btrfs: balance btree dirty pages and delayed items after a rename btrfs: free the path earlier when creating a new inode btrfs: balance btree dirty pages and delayed items after clone and dedupe btrfs: add assertions when deleting batches of delayed items btrfs: deal with deletion errors when deleting delayed items btrfs: refactor the delayed item deletion entry point btrfs: improve batch deletion of delayed dir index items btrfs: assert that delayed item is a dir index item when adding it btrfs: improve batch insertion of delayed dir index items btrfs: do not BUG_ON() on failure to reserve metadata for delayed item btrfs: set delayed item type when initializing it btrfs: reduce amount of reserved metadata for delayed item insertion Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:36 +02:00
Filipe Manana	c9d02ab4b4	btrfs: set delayed item type when initializing it Currently we set the type of a delayed item only after successfully inserting it into its respective rbtree. This is fine, as the type is not used anywhere before that point, but for the next patch in the series, there will be the need to check the type of a delayed item before inserting it into a rbtree. So set the type of a delayed item immediately after allocating it. This also makes the trivial wrappers for adding insertion and deletion useless, so it removes them as well. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:36 +02:00
Filipe Manana	3bae13e9d4	btrfs: do not BUG_ON() on failure to reserve metadata for delayed item At btrfs_insert_delayed_dir_index(), we don't expect the metadata reservation for the delayed dir index item insertion to fail, because the caller is supposed to have reserved 1 unit of metadata space for that. All callers are able to deal with an error in case that happens, so there is no need for something so drastic as a BUG_ON() in case of failure. Instead just emit a warning, so that's easily noticed during development (fstests in particular), and return the error to the caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	06ac264f3f	btrfs: improve batch insertion of delayed dir index items Currently we group delayed dir index items for insertion as a single batch (a single btree operation) as long as their keys are sequential in the key space. For example we have delayed index items for the following index keys: 10, 11, 12, 15, 16, 20, 21 We end up building three batches: 1) First one for index keys 10, 11 and 12; 2) Second one for index keys 15 and 16; 3) Third one for index keys 20 and 21. However, since the dir index numbers come from a monotonically increasing counter and are never reused, we could group all these items into a single batch. The existence of holes in the sequence happens only when we had delayed dir index items for insertion that got deleted before they were flushed to the subvolume's tree. The delayed items are stored in a rbtree based on their key order, so we can just group items into a batch as long as they all fit in a leaf, and ignore if there's a gap (key offset, index number) between two consecutive items. This is more efficient and reduces the amount of time spent when running delayed items if there are gaps between dir index items. For example running the following test script: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT NUM_FILES=100 mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Now delete every other file, to create gaps in the dir index keys. for ((i = 1; i <= $NUM_FILES; i += 2)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) sync end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nsync took $dur milliseconds" umount $MNT While having the following bpftrace script running in another shell: $ cat bpf-delayed-items-inserts.sh #!/usr/bin/bpftrace /* Must add 'noinline' to btrfs_insert_delayed_items(). */ k:btrfs_insert_delayed_items { @start_insert_delayed_items[tid] = nsecs; } k:btrfs_insert_empty_items /@start_insert_delayed_items[tid]/ { @insert_batches = count(); } kr:btrfs_insert_delayed_items /@start_insert_delayed_items[tid]/ { $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000; @btrfs_insert_delayed_items_total_time = sum($dur); delete(@start_insert_delayed_items[tid]); } Before this change: @btrfs_insert_delayed_items_total_time: 576 @insert_batches: 51 After this change: @btrfs_insert_delayed_items_total_time: 174 @insert_batches: 2 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	a176affe54	btrfs: assert that delayed item is a dir index item when adding it All delayed items are for dir index items, we don't support any other item types at the moment. So simplify __btrfs_add_delayed_item() and add an assertion for checking the item's key type. This also allows the next change to be simpler and avoid to check key types. In case we add support for different item types in the future, then we'll hit the assertion during development and be able to adjust any code that is assuming delayed items are always associated to dir index items. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	4bd02d9012	btrfs: improve batch deletion of delayed dir index items Currently we group delayed dir index items for deletion in a single batch (single btree operation) as long as they all exist in the same leaf and as long as their keys are sequential in the key space. For example if we have a leaf that has dir index items with offsets: 2, 3, 4, 6, 7, 10 And we have delayed dir index items for deleting all these indexes, and no delayed items for any other index keys in between, then we end up deleting in 3 batches: 1) First batch for indexes 2, 3 and 4; 2) Second batch for indexes 6 and 7; 3) Third batch for index 10. This is a waste because we can delete all the index keys in a single batch. What matters is that each consecutive delayed index key matches each consecutive dir index key in a leaf. So update the logic at btrfs_batch_delete_items() to check only for a key match between delayed dir index items and dir index items in a leaf. Also avoid the useless first iteration on comparing the key of the first slot to delete with the key of the first delayed item, as it's silly since they always match, as the delayed item's key was used for the btree search that gave us the path we have. This is more efficient and reduces runtime of running delayed items, as well as lock contention on the subvolume's tree. For example, the following test script: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT NUM_FILES=1000 mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Now delete every other file, to create gaps in the dir index keys. for ((i = 1; i <= $NUM_FILES; i += 2)); do rm -f $MNT/testdir/file_$i done # Sync to force any delayed items to be flushed to the tree. sync start=$(date +%s%N) rm -fr $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nrm -fr took $dur milliseconds" umount $MNT Running that test script while having the following bpftrace script running in another shell: $ cat bpf-measure.sh #!/usr/bin/bpftrace /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */ k:btrfs_delete_delayed_items { @start_delete_delayed_items[tid] = nsecs; } k:btrfs_del_items /@start_delete_delayed_items[tid]/ { @delete_batches = count(); } kr:btrfs_delete_delayed_items /@start_delete_delayed_items[tid]/ { $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000; @btrfs_delete_delayed_items_total_time = sum($dur); delete(@start_delete_delayed_items[tid]); } Before this change: @btrfs_delete_delayed_items_total_time: 9563 @delete_batches: 1001 After this change: @btrfs_delete_delayed_items_total_time: 7328 @delete_batches: 509 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	36baa2c751	btrfs: refactor the delayed item deletion entry point The delayed item deletion entry point, btrfs_delete_delayed_items(), is a bit convoluted for a few reasons: 1) It's really a loop disguised with labels and goto statements; 2) There's a 'delete_fail' label which isn't only for error cases, we can jump to that label even if no error happened, if we simply don't have more delayed items to delete; 3) Unnecessarily keeps track of the current and previous items for no good reason, as after getting the next item and releasing the current one, it just jumps to the 'again' label just to look again for the first delayed item; 4) When a delayed item is not in the tree (because it was already deleted before), it releases the item while holding a path locked, which is not necessary and adds more contention to the tree, specially taking into account that the path came from a deletion search, meaning we have write locks for nodes at levels 2, 1 and 0. And releasing the item is not computationally trivial (rb tree deletion, a kfree() and some trivial things). So refactor it to use a while loop and add some comments to make it more obvious why we can have delayed items without a matching item in the tree as well as why not keep the delayed node locked all the time when running all its deletion items. This is also a preparation for some upcoming work involving delayed items. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	2b1d260de1	btrfs: deal with deletion errors when deleting delayed items Currently, btrfs_delete_delayed_items() ignores any errors returned from btrfs_batch_delete_items(). This looks fishy but it's not a problem at the moment because: 1) Two of the errors returned from btrfs_batch_delete_items() are for impossible cases, cases where a delayed item does not match any item in the leaf the path points to - btrfs_delete_delayed_items() always calls btrfs_batch_delete_items() with a path that points to a leaf that contains an item matching a delayed item; 2) btrfs_batch_delete_items() may return an error from btrfs_del_items(), in which case it does not release the delayed items of the batch. At the moment this is harmless because btrfs_del_items() actually is always able to delete items, even if it returns an error - when it returns an error it's because it ended up with a leaf mostly empty (less than 1/3 full) and failed to migrate items from that leaf into its neighbour leaves - this is not critical, as all the items were deleted, we just left the tree a bit unbalanced, but it's still a valid tree and causes no harm, and future operations on the tree will eventually balance it. So even if we get an error from btrfs_del_items(), the delayed items will not be released but the next time we run delayed items we will find out, at btrfs_delete_delayed_items(), that they are not present in the tree anymore and then release them. This is all a bit subtle, and it's certainly prone to be a disaster in case btrfs_del_items() changes one day and may return errors before being able to delete all the requested items, in which case we could leave the filesystem in an inconsistent state as we would commit a transaction despite a failure from deleting items from the tree. So make btrfs_delete_delayed_items() check for any errors from the call to btrfs_batch_delete_items(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	659192e668	btrfs: add assertions when deleting batches of delayed items There are a few impossible cases that btrfs_batch_delete_items() tries to deal with: 1) Getting a path pointing to a NULL leaf; 2) The leaf slot is pointing beyond the last item in the leaf; 3) We can't find a single item to delete. The first case is impossible because the given path was returned by a successful call to btrfs_search_slot(). Replace the BUG_ON() with an ASSERT for this. The second case is impossible because we are always called when a delayed item matches an item in the given leaf. So add an ASSERT() for that and if that condition is not satisfied, trigger a warning and return an error. The third case is impossible exactly because of the same reason as the second case. The given delayed item matches one item in the leaf, so we know that our batch always has at least one item. Add an ASSERT to check that, trigger a warning if that expectation fails and return an error. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	6fe81a3a3a	btrfs: balance btree dirty pages and delayed items after clone and dedupe When reflinking extents (clone and deduplication), we need to touch the btree of the destination inode's subvolume, as well as potentially create a delayed inode for the destination inode (if it was not created before). However we are neither balancing the btree dirty pages nor the delayed items after such operations, so if we have a task that is doing a long series of clone or deduplication operations, it can result in accumulation of too many btree dirty pages and delayed items. So just call btrfs_btree_balance_dirty() after clone and deduplication, just like we do for every other system call that results on modifying a btree and adding delayed items. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	814e77182b	btrfs: free the path earlier when creating a new inode When creating an inode, through btrfs_create_new_inode(), we release the path we allocated before once we don't need it anymore. But we keep it allocated until we return from that function, which is wasteful because after we release the path we do several things that can allocate yet another path: inheriting properties, setting the xattrs used by ACLs and secutiry modules, adding an orphan item (O_TMPFILE case) or adding a dir item (for the non-O_TMPFILE case). So instead of releasing the path once we don't need it anymore, free it instead. This way we avoid having two paths allocated until we return from btrfs_create_new_inode(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	ca6dee6b79	btrfs: balance btree dirty pages and delayed items after a rename A rename operation modifies a subvolume's btree, to remove the old dir item, add the new dir item, remove an inode ref and add a new inode ref. It can also create the delayed inode for the inodes involved in the operation, and it creates two delayed dir index items, one to delete the old name and another one to add the new name. However we are neither balancing the btree dirty pages nor the delayed items after a rename, which can result in accumulation of too many btree dirty pages and delayed items, specially if a task is doing a series of rename operations (for example it can happen for package installations/upgrades through the zypper tool). So just call btrfs_btree_balance_dirty() after a rename, just like we do for every other system call that results on modifying a btree and adding delayed items. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Qu Wenruo	b8bea09a45	btrfs: add trace event for submitted RAID56 bio Add tracepoint for better insight to how the RAID56 data are submitted. The output looks like this: (trace event header and UUID skipped) raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768 raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536 raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768 raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768 The above debug output is from a 32K data write into an empty RAID56 data chunk. Some explanation on the event output: full_stripe: the logical bytenr of the full stripe devid: btrfs devid type: raid stripe type. DATA1: the first data stripe DATA2: the second data stripe PQ1: the P stripe PQ2: the Q stripe offset: the offset inside the stripe. opf: the bio op type physical: the physical offset the bio is for len: the length of the bio The first two lines are from partial RMW read, which is reading the remaining data stripes from disks. The last two lines are for full stripe RMW write, which is writing the involved two 16K stripes (one for DATA1 stripe, one for P stripe). The stripe for DATA2 doesn't need to be written. There are 5 types of trace events: - raid56_read_partial Read remaining data for regular read/write path. - raid56_write_stripe Write the modified stripes for regular read/write path. - raid56_scrub_read_recover Read remaining data for scrub recovery path. - raid56_scrub_write_stripe Write the modified stripes for scrub path. - raid56_scrub_read Read remaining data for scrub path. Also, since the trace events are included at super.c, we have to export needed structure definitions to 'raid56.h' and include the header in super.c, or we're unable to access those members. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	4d10046613	btrfs: update stripe_sectors::uptodate in steal_rbio [BUG] With added debugging, it turns out the following write sequence would cause extra read which is unnecessary: # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \ -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \ $mnt/file The debug message looks like this (btrfs header skipped): partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 ^^^^ Still partial read, even 389152768 is already cached by the first. write. full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 ^^^^ Still partial read for 298844160. full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768 This means every 32K writes, even they are in the same full stripe, still trigger read for previously cached data. This would cause extra RAID56 IO, making the btrfs raid56 cache useless. [CAUSE] Commit `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") tries to make steal_rbio() subpage compatible, but during that conversion, there is one thing missing. We no longer rely on PageUptodate(rbio->stripe_pages[i]), but rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate. This means, previously if we switch the pointer, everything is done, as the PageUptodate flag is still bound to that page. But now we have to manually mark the involved sectors uptodate, or later raid56_rmw_stripe() will find the stolen sector is not uptodate, and assemble the read bio for it, wasting IO. [FIX] We can easily fix the bug, by also update the rbio->stripe_sectors[].uptodate in steal_rbio(). With this fixed, now the same write pattern no longer leads to the same unnecessary read: partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768 ^^^ No more partial read, directly into the write path. full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768 Fixes: `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
David Sterba	21a8935ead	btrfs: remove redundant calls to flush_dcache_page Both memzero_page and memcpy_to_page already call flush_dcache_page so we can remove the calls from btrfs code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	bd8f7e6277	btrfs: only write the sectors in the vertical stripe which has data stripes If we have only 8K partial write at the beginning of a full RAID56 stripe, we will write the following contents: 0 8K 32K 64K Disk 1 (data): \|XX\| \| \| Disk 2 (data): \| \| \| Disk 3 (parity): \|XXXXXXXXXXXXXXX\|XXXXXXXXXXXXXXX\| \|X\| means the sector will be written back to disk. Note that, although we won't write any sectors from disk 2, but we will write the full 64KiB of parity to disk. This behavior is fine for now, but not for the future (especially for RAID56J, as we waste quite some space to journal the unused parity stripes). So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we queue a higher level bio into an rbio, we will update rbio::dbitmap to indicate which vertical stripes we need to writeback. And at finish_rmw(), we also check dbitmap to see if we need to write any sector in the vertical stripe. So after the patch, above example will only lead to the following writeback pattern: 0 8K 32K 64K Disk 1 (data): \|XX\| \| \| Disk 2 (data): \| \| \| Disk 3 (parity): \|XX\| \| \| Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	381b9b4c9c	btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap Previously we use "unsigned long " for those two bitmaps. But since we only support fixed stripe length (64KiB, already checked in tree-checker), "unsigned long " is really a waste of memory, while we can just use "unsigned long". This saves us 8 bytes in total for scrub_parity. To be extra safe, add an ASSERT() making sure calclulated @nsectors is always smaller than BITS_PER_LONG. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	c67c68eb57	btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap Previsouly we use "unsigned long " for those two bitmaps. But since we only support fixed stripe length (64KiB, already checked in tree-checker), "unsigned long " is really a waste of memory, while we can just use "unsigned long". This saves us 8 bytes in total for btrfs_raid_bio. To be extra safe, add an ASSERT() making sure calculated @stripe_nsectors is always smaller than BITS_PER_LONG. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Nikolay Borisov	099aa97213	btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance This eliminates 2 labels and makes the code generally more streamlined. Also rename the 'out_bargs' label to 'out_unlock' since bargs is going to be freed under the 'out' label. This also fixes a memory leak since bargs wasn't correctly freed in one of the condition which are now moved in btrfs_try_lock_balance. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Nikolay Borisov	7fb10ed89e	btrfs: introduce btrfs_try_lock_balance This function contains the factored out locking sequence of btrfs_ioctl_balance. Having this piece of code separate helps to simplify btrfs_ioctl_balance which has too complicated. This will be used in the next patch to streamline the logic in btrfs_ioctl_balance. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	1e87770cb3	btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio Use the new btrfs_bio_for_each_sector iterator to simplify btrfs_check_read_dio_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	261d812b04	btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks Add a helper that works similar to __bio_for_each_segment, but instead of iterating over PAGE_SIZE chunks it iterates over each sector. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: split from a larger patch, and iterate over the offset instead of the offset bits] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ add parameter comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	a89ce08ce6	btrfs: factor out a btrfs_csum_ptr helper Add a helper to find the csum for a byte offset into the csum buffer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	97861cd166	btrfs: refactor end_bio_extent_readpage code flow Untangle the goto and move the code it jumps to so it goes in the order of the most likely states first. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	a5aa7ab6e7	btrfs: factor out a helper to end a single sector buffer I/O Add a helper to end I/O on a single sector, which will come in handy with the new read repair code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Qu Wenruo	fd5a6f63cb	btrfs: remove duplicated parameters from submit_data_read_repair() The function submit_data_read_repair() is only called for buffered data read path, thus those members can be calculated using bvec directly: - start start = page_offset(bvec->bv_page) + bvec->bv_offset; - end end = start + bvec->bv_len - 1; - page page = bvec->bv_page; - pgoff pgoff = bvec->bv_offset; Thus we can safely replace those 4 parameters with just one bio_vec. Also remove the unused return value. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: also remove the return value] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Qu Wenruo	ae643a74eb	btrfs: introduce a data checksum checking helper Although we have several data csum verification code, we never have a function really just to verify checksum for one sector. Function check_data_csum() do extra work for error reporting, thus it requires a lot of extra things like file offset, bio_offset etc. Function btrfs_verify_data_csum() is even worse, it will utilize page checked flag, which means it can not be utilized for direct IO pages. Here we introduce a new helper, btrfs_check_sector_csum(), which really only accept a sector in page, and expected checksum pointer. We use this function to implement check_data_csum(), and export it for incoming patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: keep passing the csum array as an arguments, as the callers want to print it, rename per request] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Qu Wenruo	b036f47996	btrfs: quit early if the fs has no RAID56 support for raid56 related checks The following functions do special handling for RAID56 chunks: - btrfs_is_parity_mirror() Check if the range is in RAID56 chunks. - btrfs_full_stripe_len() Either return sectorsize for non-RAID56 profiles or full stripe length for RAID56 chunks. But if a filesystem without any RAID56 chunks, it will not have RAID56 incompat flags, and we can skip the chunk tree looking up completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Fanjun Kong	1280d2d165	btrfs: use PAGE_ALIGNED instead of IS_ALIGNED The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's use it instead of IS_ALIGNED and passing PAGE_SIZE directly. Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Pankaj Raghav	31f3726980	btrfs: zoned: fix comment description for sb_write_pointer logic Fix the comment to represent the actual logic used for sb_write_pointer - Empty[0] && In use[1] should be an invalid state instead of returning zone 0 wp - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp - In use[0] && Empty[1] should be returning zone 0 wp instead of being an invalid state - In use[0] && Full[1] should be returning zone 0 wp instead of returning zone 1 wp - Full[0] && Empty[1] should be returning zone 1 wp instead of returning zone 0 wp - Full[0] && In use[1] should be returning zone 1 wp instead of returning zone 0 wp Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
David Sterba	143823cf4d	btrfs: fix typos in comments Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Linus Torvalds	972a278fe6	for-5.19-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP 1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+ FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi 2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I= =/47A -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs reverts from David Sterba: "Due to a recent report [1] we need to revert the radix tree to xarray conversion patches. There's a problem with sleeping under spinlock, when xa_insert could allocate memory under pressure. We use GFP_NOFS so this is a real problem that we unfortunately did not discover during review. I'm sorry to do such change at rc6 time but the revert is IMO the safer option, there are patches to use mutex instead of the spin locks but that would need more testing. The revert branch has been tested on a few setups, all seem ok. The conversion to xarray will be revisited in the future" Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1] * tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: turn delayed_nodes_tree into an XArray" Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" Revert "btrfs: turn fs_info member buffer_radix into XArray" Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"	2022-07-16 13:48:55 -07:00
David Sterba	088aea3b97	Revert "btrfs: turn delayed_nodes_tree into an XArray" This reverts commit `253bf57555`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:15:19 +02:00
David Sterba	5b8418b843	Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" This reverts commit `4076942021`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:58 +02:00
David Sterba	01cd390903	Revert "btrfs: turn fs_info member buffer_radix into XArray" This reverts commit `8ee922689d`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:33 +02:00
David Sterba	fc7cbcd489	Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray" This reverts commit `48b36a602a`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:28 +02:00
Bart Van Assche	bf9486d6dd	fs/btrfs: Use the enum req_op and blk_opf_t types Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Linus Torvalds	5a29232d87	for-5.19-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLMiQQACgkQxWXV+ddt WDvBQg/+I1ebfW2DFY8kBwy7c1qKZWIhNx1VVk2AegIXvrW/Tos7wp5O6fi7p/jL d6k8zO/zFLlfiI4Ckmz3gt7cxaMTNXxr6+GQpNNm1b92Wdcy1a+3gquzcehT9Q10 ZB4ecPWzEDXgORvdBYG2eD2Z8PrsF0Wu88XRDiiJOBQLjZ+k2sVp8QvJlOllLDoC m7rPoq98jC6VpZwFJ+fGk2jC7y4+1QXrOuQMy7LRTe59Thp6wUFDDPtkKfr5scDC UxkctlUdInD7A6DVvPzwaBFNoT8UeEByGHcMd3KjjrTdmqSWW6k8FiF4ckZwA3zJ oPdJVzdC5a2W7t6BHw+t7VNmkKd+swnr2sVSGQ8eIzF7z3/JSqyYVwziOD1YzAdU QUmawWm4/SFvsbO8aoLrEKNbUiTgQwVbKzJh4Dhu9VJ43jeCwCX7pa/uZI4evgyG T0tuwm58bWCk4y1o1fcFYgf4JcVgK23F2vKckUFZeHoV3Q8R0DnPCCGTqs1qT5vY irZ9AIawmaR09JptMjjsAEjDA9qb16Ut/J6/anukyCgL610EyYZG7zb1WH1cUD1o zNXY6O/iKyNdiXj7V1fTMiG/M8hGDcFu4pOpBk3hFjHEXX9BefoVC0J5YzvCecPz isqboD5Lt1I4mrzac1X+serMYfVbFH6+tsEPBQBZf/o/a0u43jI= =Cxvn -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A more fixes that seem to me to be important enough to get merged before release: - in zoned mode, fix leak of a structure when reading zone info, this happens on normal path so this can be significant - in zoned mode, revert an optimization added in 5.19-rc1 to finish a zone when the capacity is full, but this is not reliable in all cases - try to avoid short reads for compressed data or inline files when it's a NOWAIT read, applications should handle that but there are two, qemu and mariadb, that are affected" * tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: drop optimization of zone finish btrfs: zoned: fix a leaked bioc in read_zone_info btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents	2022-07-11 14:41:44 -07:00
Naohiro Aota	b3a3b02557	btrfs: zoned: drop optimization of zone finish We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we wrote fully into the zone. The assumption is determined by "alloc_offset == capacity". This condition won't work if the last ordered extent is canceled due to some errors. In that case, we consider the zone is deactivated without sending the finish command while it's still active. This inconstancy results in activating another block group while we cannot really activate the underlying zone, which causes the active zone exceeds errors like below. BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0 nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0 nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0 Fix the issue by removing the optimization for now. Fixes: `8376d9e1ed` ("btrfs: zoned: finish superblock zone once no space left for new SB") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-08 19:18:00 +02:00
Christoph Hellwig	2963457829	btrfs: zoned: fix a leaked bioc in read_zone_info The bioc would leak on the normal completion path and also on the RAID56 check (but that one won't happen in practice due to the invalid combination with zoned mode). Fixes: `7db1c5d14d` ("btrfs: zoned: support dev-replace in zoned filesystems") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-08 19:13:32 +02:00
Filipe Manana	a4527e1853	btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents When doing a direct IO read or write, we always return -ENOTBLK when we find a compressed extent (or an inline extent) so that we fallback to buffered IO. This however is not ideal in case we are in a NOWAIT context (io_uring for example), because buffered IO can block and we currently have no support for NOWAIT semantics for buffered IO, so if we need to fallback to buffered IO we should first signal the caller that we may need to block by returning -EAGAIN instead. This behaviour can also result in short reads being returned to user space, which although it's not incorrect and user space should be able to deal with partial reads, it's somewhat surprising and even some popular applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't deal with short reads properly (or at all). The short read case happens when we try to read from a range that has a non-compressed and non-inline extent followed by a compressed extent. After having read the first extent, when we find the compressed extent we return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to treat the request as a short read, returning 0 (success) and waiting for previously submitted bios to complete (this happens at fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at btrfs_file_read_iter(), we call filemap_read() to use buffered IO to read the remaining data, and pass it the number of bytes we were able to read with direct IO. Than at filemap_read() if we get a page fault error when accessing the read buffer, we return a partial read instead of an -EFAULT error, because the number of bytes previously read is greater than zero. So fix this by returning -EAGAIN for NOWAIT direct IO when we find a compressed or an inline extent. Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/ Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582 Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-08 19:13:22 +02:00
Roman Gushchin	e33c267ab7	mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Linus Torvalds	82708bb1eb	for-5.19-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930 XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2 RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq 4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+ QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo= =rUrf -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned relocation fixes: - fix critical section end for extent writeback, this could lead to out of order write - prevent writing to previous data relocation block group if space gets low - reflink fixes: - fix race between reflinking and ordered extent completion - proper error handling when block reserve migration fails - add missing inode iversion/mtime/ctime updates on each iteration when replacing extents - fix deadlock when running fsync/fiemap/commit at the same time - fix false-positive KCSAN report regarding pid tracking for read locks and data race - minor documentation update and link to new site * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Documentation: update btrfs list of features and link to readthedocs.io btrfs: fix deadlock with fsync+fiemap+transaction commit btrfs: don't set lock_owner when locking extent buffer for reading btrfs: zoned: fix critical section of relocation inode writeback btrfs: zoned: prevent allocation from previous data relocation BG btrfs: do not BUG_ON() on failure to migrate space when replacing extents btrfs: add missing inode updates on each iteration when replacing extents btrfs: fix race between reflinking and ordered extent completion	2022-06-26 10:11:36 -07:00
Linus Torvalds	ff872b76b3	for-5.19-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKxvkkACgkQxWXV+ddt WDsQYhAAofZGaOdBwSDvGA4srB2ieDIFoMeNb1NYp2P5vafPo3Q5AAvgGAeKhp5x g2C7W/8q2GMJ+B9SjyiBkVufuQmCWbFKxStQM3QysYoj/EyKyp7SXtO4YMWHz2T3 nfMMlPo2aNpr7Z2s+tcjhthq/hIvVFi6kweRFNvacM2bb/17IxgAdqLpQBqK5xe9 /IGSUTw75jSd2sZSyzBqrqshKDonmJ7u4qCV2X5hTPi8w4AUDERJrm0bOnikNXHx 4LnNDmSIA0BEXybHwEAShoK0ge66z1kP1UspQNB7pKriJcyroNPjgm/fMZJiRKIc zEYEMSzTYQa5eDwhXCz5PCaPqY4y/ovfYCsmySVXt1a7wgplVl+vsOaesE2NFVCX FE36d58L+4I8iTJhpVCNmEU9N/spfvAr3mBAcKCkbp9WKyGJ9/2yJpRThkV8Pw2Y bzhFIYRs1CJvkK7P4Cp+FSfzJx6tvYAqblvE97VUt83PuqS1Fb49lKdr5DZnbplV vDkewmvXSmHH9Ic5xBeTJXJZ+yeibk/0LSNEKczWva6f60h0ubF0OI6BzmS+NZbN HyitKerX0ZyFi5VUOZ+PKzXfR3ZlX3SmjAcHrl9BjZjFOJkpxAx6yWBzdnkitb+O fYyT68H4IetxwkghPVBv8qFCkuNy/i9NsEILcAAXd8CHGQlfwDA= =eORM -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - print more error messages for invalid mount option values - prevent remount with v1 space cache for subpage filesystem - fix hang during unmount when block group reclaim task is running * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add error messages to all unrecognized mount options btrfs: prevent remounting to v1 space cache for subpage mount btrfs: fix hang during unmount when block group reclaim task is running	2022-06-21 12:06:04 -05:00
Josef Bacik	bf7ba8ee75	btrfs: fix deadlock with fsync+fiemap+transaction commit We are hitting the following deadlock in production occasionally Task 1 Task 2 Task 3 Task 4 Task 5 fsync(A) start trans start commit falloc(A) lock 5m-10m start trans wait for commit fiemap(A) lock 0-10m wait for 5m-10m (have 0-5m locked) have btrfs_need_log_full_commit !full_sync wait_ordered_extents finish_ordered_io(A) lock 0-5m DEADLOCK We have an existing dependency of file extent lock -> transaction. However in fsync if we tried to do the fast logging, but then had to fall back to committing the transaction, we will be forced to call btrfs_wait_ordered_range() to make sure all of our extents are updated. This creates a dependency of transaction -> file extent lock, because btrfs_finish_ordered_io() will need to take the file extent lock in order to run the ordered extents. Fix this by stopping the transaction if we have to do the full commit and we attempted to do the fast logging. Then attach to the transaction and commit it if we need to. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:47:08 +02:00
Zygo Blaxell	97e86631bc	btrfs: don't set lock_owner when locking extent buffer for reading In `196d59ab9c` "btrfs: switch extent buffer tree lock to rw_semaphore" the functions for tree read locking were rewritten, and in the process the read lock functions started setting eb->lock_owner = current->pid. Previously lock_owner was only set in tree write lock functions. Read locks are shared, so they don't have exclusive ownership of the underlying object, so setting lock_owner to any single value for a read lock makes no sense. It's mostly harmless because write locks and read locks are mutually exclusive, and none of the existing code in btrfs (btrfs_init_new_buffer and print_eb_refs_lock) cares what nonsense is written in lock_owner when no writer is holding the lock. KCSAN does care, and will complain about the data race incessantly. Remove the assignments in the read lock functions because they're useless noise. Fixes: `196d59ab9c` ("btrfs: switch extent buffer tree lock to rw_semaphore") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:46:56 +02:00
Naohiro Aota	19ab78ca86	btrfs: zoned: fix critical section of relocation inode writeback We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to write out to the relocation inode. That critical section must include all the IO submission for the inode. However, flush_write_bio() in extent_writepages() is out of the critical section, causing an IO submission outside of the lock. This leads to an out of the order IO submission and fail the relocation process. Fix it by extending the critical section. Fixes: `35156d8527` ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:46:30 +02:00
Naohiro Aota	343d8a3085	btrfs: zoned: prevent allocation from previous data relocation BG After commit `5f0addf7b8` ("btrfs: zoned: use dedicated lock for data relocation"), we observe IO errors on e.g, btrfs/232 like below. [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs] <snip> [09.9][T4038707] Call Trace: [09.5][T4038707] <TASK> [09.3][T4038707] run_delalloc_nocow+0x7f1/0x11a0 [btrfs] [09.6][T4038707] ? test_range_bit+0x174/0x320 [btrfs] [09.2][T4038707] ? fallback_to_cow+0x980/0x980 [btrfs] [09.3][T4038707] ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs] [09.5][T4038707] btrfs_run_delalloc_range+0x445/0x1320 [btrfs] [09.2][T4038707] ? test_range_bit+0x320/0x320 [btrfs] [09.4][T4038707] ? lock_downgrade+0x6a0/0x6a0 [09.2][T4038707] ? orc_find.part.0+0x1ed/0x300 [09.5][T4038707] ? __module_address.part.0+0x25/0x300 [09.0][T4038707] writepage_delalloc+0x159/0x310 [btrfs] <snip> [09.4][ C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current] [09.9][ C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00 [09.4][ C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0 [09.9][ C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 The IO errors occur when we allocate a regular extent in previous data relocation block group. On zoned btrfs, we use a dedicated block group to relocate a data extent. Thus, we allocate relocating data extents (pre-alloc) only from the dedicated block group and vice versa. Once the free space in the dedicated block group gets tight, a relocating extent may not fit into the block group. In that case, we need to switch the dedicated block group to the next one. Then, the previous one is now freed up for allocating a regular extent. The BG is already not enough to allocate the relocating extent, but there is still room to allocate a smaller extent. Now the problem happens. By allocating a regular extent while nocow IOs for the relocation is still on-going, we will issue WRITE IOs (for relocation) and ZONE APPEND IOs (for the regular writes) at the same time. That mixed IOs confuses the write pointer and arises the unaligned write errors. This commit introduces a new bit 'zoned_data_reloc_ongoing' to the btrfs_block_group. We set this bit before releasing the dedicated block group, and no extent are allocated from a block group having this bit set. This bit is similar to setting block_group->ro, but is different from it by allowing nocow writes to start. Once all the nocow IO for relocation is done (hooked from btrfs_finish_ordered_io), we reset the bit to release the block group for further allocation. Fixes: `c2707a2556` ("btrfs: zoned: add a dedicated data relocation block group") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:48 +02:00
Filipe Manana	650c9caba3	btrfs: do not BUG_ON() on failure to migrate space when replacing extents At btrfs_replace_file_extents(), if we fail to migrate reserved metadata space from the transaction block reserve into the local block reserve, we trigger a BUG_ON(). This is because it should not be possible to have a failure here, as we reserved more space when we started the transaction than the space we want to migrate. However having a BUG_ON() is way too drastic, we can perfectly handle the failure and return the error to the caller. So just do that instead, and add a WARN_ON() to make it easier to notice the failure if it ever happens (which is particularly useful for fstests, and the warning will trigger a failure of a test case). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:27 +02:00
Filipe Manana	983d8209c6	btrfs: add missing inode updates on each iteration when replacing extents When replacing file extents, called during fallocate, hole punching, clone and deduplication, we may not be able to replace/drop all the target file extent items with a single transaction handle. We may get -ENOSPC while doing it, in which case we release the transaction handle, balance the dirty pages of the btree inode, flush delayed items and get a new transaction handle to operate on what's left of the target range. By dropping and replacing file extent items we have effectively modified the inode, so we should bump its iversion and update its mtime/ctime before we update the inode item. This is because if the transaction we used for partially modifying the inode gets committed by someone after we release it and before we finish the rest of the range, a power failure happens, then after mounting the filesystem our inode has an outdated iversion and mtime/ctime, corresponding to the values it had before we changed it. So add the missing iversion and mtime/ctime updates. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:21 +02:00
Filipe Manana	d4597898ba	btrfs: fix race between reflinking and ordered extent completion While doing a reflink operation, if an ordered extent for a file range that does not overlap with the source and destination ranges of the reflink operation happens, we can end up having a failure in the reflink operation and return -EINVAL to user space. The following sequence of steps explains how this can happen: 1) We have the page at file offset 315392 dirty (under delalloc); 2) A reflink operation for this file starts, using the same file as both source and destination, the source range is [372736, 409600) (length of 36864 bytes) and the destination range is [208896, 245760); 3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source and destination ranges, and wait for any ordered extents in those range to complete; 4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in the inode, but we neither wait for it to complete nor any ordered extents to complete. This results in starting delalloc for the page at file offset 315392 and creating an ordered extent for that single page range; 5) We then move to btrfs_clone() and enter the loop to find file extent items to copy from the source range to destination range; 6) In the first iteration we end up at last file extent item stored in leaf A: (...) item 131 key (143616 108 315392) itemoff 5101 itemsize 53 extent data disk bytenr 1903988736 nr 73728 extent data offset 12288 nr 61440 ram 73728 This represents the file range [315392, 376832), which overlaps with the source range to clone. @datal is set to 61440, key.offset is 315392 and @next_key_min_offset is therefore set to 376832 (315392 + 61440). @off (372736) is > key.offset (315392), so @new_key.offset is set to the value of @destoff (208896). @new_key.offset == @last_dest_end (208896) so @drop_start is set to 208896 (@new_key.offset). @datal is adjusted to 4096, as @off is > @key.offset. So in this iteration we call btrfs_replace_file_extents() for the range [208896, 212991] (a single page, which is [@drop_start, @new_key.offset + @datal - 1]). @last_dest_end is set to 212992 (@new_key.offset + @datal = 208896 + 4096 = 212992). Before the next iteration of the loop, @key.offset is set to the value 376832, which is @next_key_min_offset; 7) On the second iteration btrfs_search_slot() leaves us again at leaf A, but this time pointing beyond the last slot of leaf A, as that's where a key with offset 376832 should be at if it existed. So end up calling btrfs_next_leaf(); 8) btrfs_next_leaf() releases the path, but before it searches again the tree for the next key/leaf, the ordered extent for the single page range at file offset 315392 completes. That results in trimming the file extent item we processed before, adjusting its key offset from 315392 to 319488, reducing its length from 61440 to 57344 and inserting a new file extent item for that single page range, with a key offset of 315392 and a length of 4096. Leaf A now looks like: (...) item 132 key (143616 108 315392) itemoff 4995 itemsize 53 extent data disk bytenr 1801666560 nr 4096 extent data offset 0 nr 4096 ram 4096 item 133 key (143616 108 319488) itemoff 4942 itemsize 53 extent data disk bytenr 1903988736 nr 73728 extent data offset 16384 nr 57344 ram 73728 9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A at slot 133, since it's the first key that follows what was the last key we saw (143616 108 315392). In fact it's the same item we processed before, but its key offset was changed, so it counts as a new key; 10) So now we have: @key.offset == 319488 @datal == 57344 @off (372736) is > key.offset (319488), so @new_key.offset is set to 208896 (@destoff value). @new_key.offset (208896) != @last_dest_end (212992), so @drop_start is set to 212992 (@last_dest_end value). @datal is adjusted to 4096 because @off > @key.offset. So in this iteration we call btrfs_replace_file_extents() for the invalid range of [212992, 212991] (which is [@drop_start, @new_key.offset + @datal - 1]). This range is empty, the end offset is smaller than the start offset so btrfs_replace_file_extents() returns -EINVAL, which we end up returning to user space and fail the reflink operation. This all happens because the range of this file extent item was already processed in the previous iteration. This scenario can be triggered very sporadically by fsx from fstests, for example with test case generic/522. So fix this by having btrfs_clone() skip file extent items that cover a file range that we have already processed. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:13 +02:00
Al Viro	91b94c5d6a	iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC New helper to be used instead of direct checks for IOCB_DSYNC: iocb_is_dsync(iocb). Checks converted, which allows to avoid the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines) from iocb_flags() - it's checked in iocb_is_dsync() instead Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-06-10 16:05:15 -04:00
Al Viro	eacdf4eaca	btrfs: use IOMAP_DIO_NOSYNC ... instead of messing with iocb flags Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2022-06-10 16:04:13 -04:00
David Sterba	e3a4167c88	btrfs: add error messages to all unrecognized mount options Almost none of the errors stemming from a valid mount option but wrong value prints a descriptive message which would help to identify why mount failed. Like in the linked report: $ uname -r v4.19 $ mount -o compress=zstd /dev/sdb /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. $ dmesg ... BTRFS error (device sdb): open_ctree failed Errors caused by memory allocation failures are left out as it's not a user error so reporting that would be confusing. Link: https://lore.kernel.org/linux-btrfs/9c3fec36-fc61-3a33-4977-a7e207c3fa4e@gmx.de/ CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-07 17:29:50 +02:00
Qu Wenruo	0591f04036	btrfs: prevent remounting to v1 space cache for subpage mount Upstream commit `9f73f1aef9` ("btrfs: force v2 space cache usage for subpage mount") forces subpage mount to use v2 cache, to avoid deprecated v1 cache which doesn't support subpage properly. But there is a loophole that user can still remount to v1 cache. The existing check will only give users a warning, but does not really prevent to do the remount. Although remounting to v1 will not cause any problems since the v1 cache will always be marked invalid when mounted with a different page size, it's still better to prevent v1 cache at all for subpage mounts. Fixes: `9f73f1aef9` ("btrfs: force v2 space cache usage for subpage mount") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-06 16:18:59 +02:00
Filipe Manana	31e70e5278	btrfs: fix hang during unmount when block group reclaim task is running When we start an unmount, at close_ctree(), if we have the reclaim task running and in the middle of a data block group relocation, we can trigger a deadlock when stopping an async reclaim task, producing a trace like the following: [629724.498185] task:kworker/u16:7 state:D stack: 0 pid:681170 ppid: 2 flags:0x00004000 [629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [629724.501267] Call Trace: [629724.501759] <TASK> [629724.502174] __schedule+0x3cb/0xed0 [629724.502842] schedule+0x4e/0xb0 [629724.503447] btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs] [629724.504534] ? prepare_to_wait_exclusive+0xc0/0xc0 [629724.505442] flush_space+0x423/0x630 [btrfs] [629724.506296] ? rcu_read_unlock_trace_special+0x20/0x50 [629724.507259] ? lock_release+0x220/0x4a0 [629724.507932] ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs] [629724.508940] ? do_raw_spin_unlock+0x4b/0xa0 [629724.509688] btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs] [629724.510922] process_one_work+0x252/0x5a0 [629724.511694] ? process_one_work+0x5a0/0x5a0 [629724.512508] worker_thread+0x52/0x3b0 [629724.513220] ? process_one_work+0x5a0/0x5a0 [629724.514021] kthread+0xf2/0x120 [629724.514627] ? kthread_complete_and_exit+0x20/0x20 [629724.515526] ret_from_fork+0x22/0x30 [629724.516236] </TASK> [629724.516694] task:umount state:D stack: 0 pid:719055 ppid:695412 flags:0x00004000 [629724.518269] Call Trace: [629724.518746] <TASK> [629724.519160] __schedule+0x3cb/0xed0 [629724.519835] schedule+0x4e/0xb0 [629724.520467] schedule_timeout+0xed/0x130 [629724.521221] ? lock_release+0x220/0x4a0 [629724.521946] ? lock_acquired+0x19c/0x420 [629724.522662] ? trace_hardirqs_on+0x1b/0xe0 [629724.523411] __wait_for_common+0xaf/0x1f0 [629724.524189] ? usleep_range_state+0xb0/0xb0 [629724.524997] __flush_work+0x26d/0x530 [629724.525698] ? flush_workqueue_prep_pwqs+0x140/0x140 [629724.526580] ? lock_acquire+0x1a0/0x310 [629724.527324] __cancel_work_timer+0x137/0x1c0 [629724.528190] close_ctree+0xfd/0x531 [btrfs] [629724.529000] ? evict_inodes+0x166/0x1c0 [629724.529510] generic_shutdown_super+0x74/0x120 [629724.530103] kill_anon_super+0x14/0x30 [629724.530611] btrfs_kill_super+0x12/0x20 [btrfs] [629724.531246] deactivate_locked_super+0x31/0xa0 [629724.531817] cleanup_mnt+0x147/0x1c0 [629724.532319] task_work_run+0x5c/0xa0 [629724.532984] exit_to_user_mode_prepare+0x1a6/0x1b0 [629724.533598] syscall_exit_to_user_mode+0x16/0x40 [629724.534200] do_syscall_64+0x48/0x90 [629724.534667] entry_SYSCALL_64_after_hwframe+0x44/0xae [629724.535318] RIP: 0033:0x7fa2b90437a7 [629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7 [629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0 [629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200 [629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000 [629724.541796] </TASK> This happens because: 1) Before entering close_ctree() we have the async block group reclaim task running and relocating a data block group; 2) There's an async metadata (or data) space reclaim task running; 3) We enter close_ctree() and park the cleaner kthread; 4) The async space reclaim task is at flush_space() and runs all the existing delayed iputs; 5) Before the async space reclaim task calls btrfs_wait_on_delayed_iputs(), the block group reclaim task which is doing the data block group relocation, creates a delayed iput at replace_file_extents() (called when COWing leaves that have file extent items pointing to relocated data extents, during the merging phase of relocation roots); 6) The async reclaim space reclaim task blocks at btrfs_wait_on_delayed_iputs(), since we have a new delayed iput; 7) The task at close_ctree() then calls cancel_work_sync() to stop the async space reclaim task, but it blocks since that task is waiting for the delayed iput to be run; 8) The delayed iput is never run because the cleaner kthread is parked, and no one else runs delayed iputs, resulting in a hang. So fix this by stopping the async block group reclaim task before we park the cleaner kthread. Fixes: `18bb8bbf13` ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-06 16:18:52 +02:00
Linus Torvalds	fdaf9a5840	Page cache changes for 5.19 - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A== =djLv -----END PGP SIGNATURE----- Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...	2022-05-24 19:55:07 -07:00
Linus Torvalds	bd1b7c1384	for-5.19-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKLxJAACgkQxWXV+ddt WDvC4BAAnSNwZ15FJKe5Y423f6PS6EXjyMuc5t/fW6UumTTbI+tsS+Glkis+JNBf BiDZSlVQmiK9WoQSJe04epZgHaK8MaCARyZaRaxjDC4Nvfq4DlD9mbAU9D6e7tZY Mo8M99D8wDW+SB+P8RBpNjwB/oGCMmE3nKC83g+1ObmA0FVRCyQ1Kazf8RzNT1rZ DiaJoKTvU1/wDN3/1rw5yG+EfW2m9A14gRCihslhFYaDV7jhpuabl8wLT7MftZtE MtJ6EOOQbgIDjnp5BEIrPmowW/N0tKDT/gorF7cWgLG2R1cbSlKgqSH1Sq7CjFUE AKj/DwfqZArPLpqMThWklCwy2B9qDEezrQSy7renP/vkeFLbOp8hQuIY5KRzohdG oDI8ThlQGtCVjbny6NX/BbCnWRAfTz0TquCgag3Xl8NbkRFgFJtkf/cSxzb+3LW1 tFeiUyTVLXVDS1cZLwgcb29Rrtp4bjd5/v3uECQlVD+or5pcAqSMkQgOBlyQJGbE Xb0nmPRihzQ8D4vINa63WwRyq0+QczVjvBxKj1daas0VEKGd32PIBS/0Qha+EpGl uFMiHBMSfqyl8QcShFk0cCbcgPMcNc7I6IAbXCE/WhhFG0ytqm9vpmlLqsTrXmHH z7/Eye/waqgACNEXoA8C4pyYzduQ4i1CeLDOdcsvBU6XQSuicSM= =lv6P -----END PGP SIGNATURE----- Merge tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - subpage: - support for PAGE_SIZE > 4K (previously only 64K) - make it work with raid56 - repair super block num_devices automatically if it does not match the number of device items - defrag can convert inline extents to regular extents, up to now inline files were skipped but the setting of mount option max_inline could affect the decision logic - zoned: - minimal accepted zone size is explicitly set to 4MiB - make zone reclaim less aggressive and don't reclaim if there are enough free zones - add per-profile sysfs tunable of the reclaim threshold - allow automatic block group reclaim for non-zoned filesystems, with sysfs tunables - tree-checker: new check, compare extent buffer owner against owner rootid Performance: - avoid blocking on space reservation when doing nowait direct io writes (+7% throughput for reads and writes) - NOCOW write throughput improvement due to refined locking (+3%) - send: reduce pressure to page cache by dropping extent pages right after they're processed Core: - convert all radix trees to xarray - add iterators for b-tree node items - support printk message index - user bulk page allocation for extent buffers - switch to bio_alloc API, use on-stack bios where convenient, other bio cleanups - use rw lock for block groups to favor concurrent reads - simplify workques, don't allocate high priority threads for all normal queues as we need only one - refactor scrub, process chunks based on their constraints and similarity - allocate direct io structures on stack and pass around only pointers, avoids allocation and reduces potential error handling Fixes: - fix count of reserved transaction items for various inode operations - fix deadlock between concurrent dio writes when low on free data space - fix a few cases when zones need to be finished VFS, iomap: - add helper to check if sb write has started (usable for assertions) - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io" * tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits) btrfs: zoned: introduce a minimal zone size 4M and reject mount btrfs: allow defrag to convert inline extents to regular extents btrfs: add "0x" prefix for unsupported optional features btrfs: do not account twice for inode ref when reserving metadata units btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer btrfs: send: avoid trashing the page cache btrfs: send: keep the current inode open while processing it btrfs: allocate the btrfs_dio_private as part of the iomap dio bio btrfs: move struct btrfs_dio_private to inode.c btrfs: remove the disk_bytenr in struct btrfs_dio_private btrfs: allocate dio_data on stack iomap: add per-iomap_iter private data iomap: allow the file system to provide a bio_set for direct I/O btrfs: add a btrfs_dio_rw wrapper btrfs: zoned: zone finish unused block group btrfs: zoned: properly finish block group on metadata write btrfs: zoned: finish block group when there are no more allocatable bytes left btrfs: zoned: consolidate zone finish functions btrfs: zoned: introduce btrfs_zoned_bg_is_full btrfs: improve error reporting in lookup_inline_extent_backref ...	2022-05-24 18:52:35 -07:00
Linus Torvalds	143a6252e1	arm64 updates for 5.19: - Initial support for the ARMv9 Scalable Matrix Extension (SME). SME takes the approach used for vectors in SVE and extends this to provide architectural support for matrix operations. No KVM support yet, SME is disabled in guests. - Support for crashkernel reservations above ZONE_DMA via the 'crashkernel=X,high' command line option. - btrfs search_ioctl() fix for live-lock with sub-page faults. - arm64 perf updates: support for the Hisilicon "CPA" PMU for monitoring coherent I/O traffic, support for Arm's CMN-650 and CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup. - Kselftest updates for SME, BTI, MTE. - Automatic generation of the system register macros from a 'sysreg' file describing the register bitfields. - Update the type of the function argument holding the ESR_ELx register value to unsigned long to match the architecture register size (originally 32-bit but extended since ARMv8.0). - stacktrace cleanups. - ftrace cleanups. - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(), avoid executable mappings in kexec/hibernate code, drop TLB flushing from get_clear_flush() (and rename it to get_clear_contig()), ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmKH19IACgkQa9axLQDI XvEFWg//bf0p6zjeNaOJmBbyVFsXsVyYiEaLUpFPUs3oB+81s2YZ+9i1rgMrNCft EIDQ9+/HgScKxJxnzWf68heMdcBDbk76VJtLALExbge6owFsjByQDyfb/b3v/bLd ezAcGzc6G5/FlI1IP7ct4Z9MnQry4v5AG8lMNAHjnf6GlBS/tYNAqpmj8HpQfgRQ ZbhfZ8Ayu3TRSLWL39NHVevpmxQm/bGcpP3Q9TtjUqg0r1FQ5sK/LCqOksueIAzT UOgUVYWSFwTpLEqbYitVqgERQp9LiLoK5RmNYCIEydfGM7+qmgoxofSq5e2hQtH2 SZM1XilzsZctRbBbhMit1qDBqMlr/XAy/R5FO0GauETVKTaBhgtj6mZGyeC9nU/+ RGDljaArbrOzRwMtSuXF+Fp6uVo5spyRn1m8UT/k19lUTdrV9z6EX5Fzuc4Mnhed oz4iokbl/n8pDObXKauQspPA46QpxUYhrAs10B/ELc3yyp/Qj3jOfzYHKDNFCUOq HC9mU+YiO9g2TbYgCrrFM6Dah2E8fU6/cR0ZPMeMgWK4tKa+6JMEINYEwak9e7M+ 8lZnvu3ntxiJLN+PrPkiPyG+XBh2sux1UfvNQ+nw4Oi9xaydeX7PCbQVWmzTFmHD q7UPQ8220e2JNCha9pULS8cxDLxiSksce06DQrGXwnHc1Ir7T04= =0DjE -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Initial support for the ARMv9 Scalable Matrix Extension (SME). SME takes the approach used for vectors in SVE and extends this to provide architectural support for matrix operations. No KVM support yet, SME is disabled in guests. - Support for crashkernel reservations above ZONE_DMA via the 'crashkernel=X,high' command line option. - btrfs search_ioctl() fix for live-lock with sub-page faults. - arm64 perf updates: support for the Hisilicon "CPA" PMU for monitoring coherent I/O traffic, support for Arm's CMN-650 and CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup. - Kselftest updates for SME, BTI, MTE. - Automatic generation of the system register macros from a 'sysreg' file describing the register bitfields. - Update the type of the function argument holding the ESR_ELx register value to unsigned long to match the architecture register size (originally 32-bit but extended since ARMv8.0). - stacktrace cleanups. - ftrace cleanups. - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(), avoid executable mappings in kexec/hibernate code, drop TLB flushing from get_clear_flush() (and rename it to get_clear_contig()), ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE. * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (145 commits) arm64/sysreg: Generate definitions for FAR_ELx arm64/sysreg: Generate definitions for DACR32_EL2 arm64/sysreg: Generate definitions for CSSELR_EL1 arm64/sysreg: Generate definitions for CPACR_ELx arm64/sysreg: Generate definitions for CONTEXTIDR_ELx arm64/sysreg: Generate definitions for CLIDR_EL1 arm64/sve: Move sve_free() into SVE code section arm64: Kconfig.platforms: Add comments arm64: Kconfig: Fix indentation and add comments arm64: mm: avoid writable executable mappings in kexec/hibernate code arm64: lds: move special code sections out of kernel exec segment arm64/hugetlb: Implement arm64 specific huge_ptep_get() arm64/hugetlb: Use ptep_get() to get the pte value of a huge page arm64: kdump: Do not allocate crash low memory if not needed arm64/sve: Generate ZCR definitions arm64/sme: Generate defintions for SVCR arm64/sme: Generate SMPRI_EL1 definitions arm64/sme: Automatically generate SMPRIMAP_EL2 definitions arm64/sme: Automatically generate SMIDR_EL1 defines arm64/sme: Automatically generate defines for SMCR ...	2022-05-23 21:06:11 -07:00
Linus Torvalds	115cd47132	for-5.19/block-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis iomvJ4yk0Q== =0Rw1 -----END PGP SIGNATURE----- Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...	2022-05-23 13:56:39 -07:00
Johannes Thumshirn	0a05fafe9d	btrfs: zoned: introduce a minimal zone size 4M and reject mount Zoned devices are expected to have zone sizes in the range of 1-2GB for ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to allow arbitrarily small zone sizes on btrfs. But for testing purposes with emulated devices it is sometimes desirable to create devices with as small as 4MB zone size to uncover errors. So use 4MB as the smallest possible zone size and reject mounts of devices with a smaller zone size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-17 20:15:25 +02:00
Qu Wenruo	d8101a0c8a	btrfs: allow defrag to convert inline extents to regular extents Btrfs defaults to max_inline=2K to make small writes inlined into metadata. The default value is always a win, as even DUP/RAID1/RAID10 doubles the metadata usage, it should still cause less physical space used compared to a 4K regular extents. But since the introduction of RAID1C3 and RAID1C4 it's no longer the case, users may find inlined extents causing too much space wasted, and want to convert those inlined extents back to regular extents. Unfortunately defrag will unconditionally skip all inline extents, no matter if the user is trying to converting them back to regular extents. So this patch will add a small exception for defrag_collect_targets() to allow defragging inline extents, if and only if the inlined extents are larger than max_inline, allowing users to convert them to regular ones. This also allows us to defrag extents like the following: item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69 generation 7 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 1 (zlib) item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53 generation 7 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 16384 ram 16384 extent compression 1 (zlib) Previously we're unable to do any defrag, since the first extent is inlined, and the second one has no extent to merge. Now we can defrag it to just one single extent, saving 48 bytes metadata space. item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13635584 nr 4096 extent data offset 0 nr 20480 ram 20480 extent compression 1 (zlib) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-17 20:15:25 +02:00
Qu Wenruo	d5321a0fa8	btrfs: add "0x" prefix for unsupported optional features The following error message lack the "0x" obviously: cannot mount because of unsupported optional features (4000) Add the prefix to make it less confusing. This can happen on older kernels that try to mount a filesystem with newer features so it makes sense to backport to older trees. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-17 20:15:25 +02:00
Filipe Manana	97bdf1a903	btrfs: do not account twice for inode ref when reserving metadata units When reserving metadata units for creating an inode, we don't need to reserve one extra unit for the inode ref item because when creating the inode, at btrfs_create_new_inode(), we always insert the inode item and the inode ref item in a single batch (a single btree insert operation, and both ending up in the same leaf). As we have accounted already one unit for the inode item, the extra unit for the inode ref item is superfluous, it only makes us reserve more metadata than necessary and often adding more reclaim pressure if we are low on available metadata space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-17 20:15:25 +02:00
Naohiro Aota	aa9ffadfca	btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer The block_group->alloc_offset is an offset from the start of the block group. OTOH, the ->meta_write_pointer is an address in the logical space. So, we should compare the alloc_offset shifted with the block_group->start. Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-17 20:15:25 +02:00
Filipe Manana	152555b39c	btrfs: send: avoid trashing the page cache A send operation reads extent data using the buffered IO path for getting extent data to send in write commands and this is both because it's simple and to make use of the generic readahead infrastructure, which results in a massive speedup. However this fills the page cache with data that, most of the time, is really only used by the send operation - once the write commands are sent, it's not useful to have the data in the page cache anymore. For large snapshots, bringing all data into the page cache eventually leads to the need to evict other data from the page cache that may be more useful for applications (and kernel subsystems). Even if extents are shared with the subvolume on which a snapshot is based on and the data is currently on the page cache due to being read through the subvolume, attempting to read the data through the snapshot will always result in bringing a new copy of the data into another location in the page cache (there's currently no shared memory for shared extents). So make send evict the data it has read before if when it first opened the inode, its mapping had no pages currently loaded: when inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding based on the return value of filemap_range_has_page() before reading an extent because the generic readahead mechanism may read pages beyond the range we request (and it very often does it), which means a call to filemap_range_has_page() will return true due to the readahead that was triggered when processing a previous extent - we don't have a simple way to distinguish this case from the case where the data was brought into the page cache through someone else. So checking for the mapping number of pages being 0 when we first open the inode is simple, cheap and it generally accomplishes the goal of not trashing the page cache - the only exception is if part of data was previously loaded into the page cache through the snapshot by some other process, in that case we end up not evicting any data send brings into the page cache, just like before this change - but that however is not the common case. Example scenario, on a box with 32G of RAM: $ btrfs subvolume create /mnt/sv1 $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1 $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1 $ free -m total used free shared buff/cache available Mem: 31937 186 26866 0 4883 31297 Swap: 8188 0 8188 # After this we get less 4G of free memory. $ btrfs send /mnt/snap1 >/dev/null $ free -m total used free shared buff/cache available Mem: 31937 186 22814 0 8935 31297 Swap: 8188 0 8188 The same, obviously, applies to an incremental send. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-17 20:14:54 +02:00
Filipe Manana	521b6803f2	btrfs: send: keep the current inode open while processing it Every time we send a write command, we open the inode, read some data to a buffer and then close the inode. The amount of data we read for each write command is at most 48K, returned by max_send_read_size(), and that corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does not add any significant overhead, because the time elapsed between every close (iput()) and open (btrfs_iget()) is very short, so the inode is kept in the VFS's cache after the iput() and it's still there by the time we do the next btrfs_iget(). As between processing extents of the current inode we don't do anything else, it makes sense to keep the inode open after we process its first extent that needs to be sent and keep it open until we start processing the next inode. This serves to facilitate the next change, which aims to avoid having send operations trash the page cache with data extents. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:33 +02:00
Christoph Hellwig	642c5d34da	btrfs: allocate the btrfs_dio_private as part of the iomap dio bio Create a new bio_set that contains all the per-bio private data needed by btrfs for direct I/O and tell the iomap code to use that instead of separately allocation the btrfs_dio_private structure. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:33 +02:00
Christoph Hellwig	a3e171a09c	btrfs: move struct btrfs_dio_private to inode.c The btrfs_dio_private structure is only used in inode.c, so move the definition there. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Christoph Hellwig	acb8b52a15	btrfs: remove the disk_bytenr in struct btrfs_dio_private This field is never used, so remove it. Last use was probably in `23ea8e5a07` ("Btrfs: load checksum data once when submitting a direct read io"). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Christoph Hellwig	491a6d0118	btrfs: allocate dio_data on stack Make use of the new iomap_iter->private field to avoid a memory allocation per iomap range. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Christoph Hellwig	786f847f43	iomap: add per-iomap_iter private data Allow the file system to keep state for all iterations. For now only wire it up for direct I/O as there is an immediate need for it there. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Christoph Hellwig	36e8c62273	btrfs: add a btrfs_dio_rw wrapper Add a wrapper around iomap_dio_rw that keeps the direct I/O internals isolated in inode.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Naohiro Aota	74e91b12b1	btrfs: zoned: zone finish unused block group While the active zones within an active block group are reset, and their active resource is released, the block group itself is kept in the active block group list and marked as active. As a result, the list will contain more than max_active_zones block groups. That itself is not fatal for the device as the zones are properly reset. However, that inflated list is, of course, strange. Also, a to-appear patch series, which deactivates an active block group on demand, gets confused with the wrong list. So, fix the issue by finishing the unused block group once it gets read-only, so that we can release the active resource in an early stage. Fixes: `be1a1d7a5d` ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Naohiro Aota	56fbb0a4e8	btrfs: zoned: properly finish block group on metadata write Commit `be1a1d7a5d` ("btrfs: zoned: finish fully written block group") introduced zone finishing code both for data and metadata end_io path. However, the metadata side is not working as it should. First, it compares logical address (eb->start + eb->len) with offset within a block group (cache->zone_capacity) in submit_eb_page(). That essentially disabled zone finishing on metadata end_io path. Furthermore, fixing the issue above revealed we cannot call btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot call btrfs_lookup_block_group() which require spin lock inside end_io context. Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer writeback and do the zone finish IO in a workqueue. Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used. Fixes: `be1a1d7a5d` ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Naohiro Aota	8b8a53998c	btrfs: zoned: finish block group when there are no more allocatable bytes left Currently, btrfs_zone_finish_endio() finishes a block group only when the written region reaches the end of the block group. We can also finish the block group when no more allocation is possible. Fixes: `be1a1d7a5d` ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Naohiro Aota	d70cbdda75	btrfs: zoned: consolidate zone finish functions btrfs_zone_finish() and btrfs_zone_finish_endio() have similar code. Introduce do_zone_finish() to factor out the common code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Naohiro Aota	1bfd476754	btrfs: zoned: introduce btrfs_zoned_bg_is_full Introduce a wrapper to check if all the space in a block group is allocated or not. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
Nikolay Borisov	cf4f03c3be	btrfs: improve error reporting in lookup_inline_extent_backref When iterating the backrefs in an extent item if the ptr to the 'current' backref record goes beyond the extent item a warning is generated and -ENOENT is returned. However what's more appropriate to debug such cases would be to return EUCLEAN and also print identifying information about the performed search as well as the current content of the leaf containing the possibly corrupted extent item. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
David Sterba	0f07003b0f	btrfs: rename bio_ctrl::bio_flags to compress_type The bio_ctrl is the last use of bio_flags that has been converted to compress type everywhere else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:32 +02:00
David Sterba	cb3a12d988	btrfs: rename bio_flags in parameters and switch type Several functions take parameter bio_flags that was simplified to just compress type, unify it and change the type accordingly. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	0ff400135b	btrfs: rename io_failure_record::bio_flags to compress_type The bio_flags is now used to store unchanged compress type, so unify that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	7f6ca7f21d	btrfs: open code extent_set_compress_type helpers The helpers extent_set_compress_type and extent_compress_type have become trivial after previous cleanups and can be removed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	2a5232a8ce	btrfs: simplify handling of bio_ctrl::bio_flags The bio_flags are used only to encode the compression and there are no other EXTENT_BIO_* flags, so the compress type can be stored directly. The struct member name is left unchanged and will be cleaned in later patches. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	572f3dad52	btrfs: remove trivial helper update_nr_written The helper used to do more with the wbc state but now it's just one subtraction, no need to have a special helper. It became trivial in `a91326679f` ("Btrfs: make mapping->writeback_index point to the last written page"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	a6f5e39ee7	btrfs: remove unused parameter bio_flags from btrfs_wq_submit_bio Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	0e3696f80f	btrfs: remove btrfs_delayed_extent_op::is_data The value of btrfs_delayed_extent_op::is_data is always false, we can cascade the change and simplify code that depends on it, removing the structure member eventually. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
David Sterba	2fe6a5a1d2	btrfs: sink parameter is_data to btrfs_set_disk_extent_flags The parameter has been added in 2009 in the infamous monster commit `5d4f98a28c` ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") but not used ever since. We can sink it and allow further simplifications. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
Filipe Manana	f5585f4f0e	btrfs: fix deadlock between concurrent dio writes when low on free data space When reserving data space for a direct IO write we can end up deadlocking if we have multiple tasks attempting a write to the same file range, there are multiple extents covered by that file range, we are low on available space for data and the writes don't expand the inode's i_size. The deadlock can happen like this: 1) We have a file with an i_size of 1M, at offset 0 it has an extent with a size of 128K and at offset 128K it has another extent also with a size of 128K; 2) Task A does a direct IO write against file range [0, 256K), and because the write is within the i_size boundary, it takes the inode's lock (VFS level) in shared mode; 3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and then gets the extent map for the extent covering the range [0, 128K). At btrfs_get_blocks_direct_write(), it creates an ordered extent for that file range ([0, 128K)); 4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file range [0, 256K); 5) Task A executes btrfs_dio_iomap_begin() again, this time for the file range [128K, 256K), and locks the file range [128K, 256K); 6) Task B starts a direct IO write against file range [0, 256K) as well. It also locks the inode in shared mode, as it's within the i_size limit, and then tries to lock file range [0, 256K). It is able to lock the subrange [0, 128K) but then blocks waiting for the range [128K, 256K), as it is currently locked by task A; 7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data space. Because we are low on available free space, it triggers the async data reclaim task, and waits for it to reserve data space; 8) The async reclaim task decides to wait for all existing ordered extents to complete (through btrfs_wait_ordered_roots()). It finds the ordered extent previously created by task A for the file range [0, 128K) and waits for it to complete; 9) The ordered extent for the file range [0, 128K) can not complete because it blocks at btrfs_finish_ordered_io() when trying to lock the file range [0, 128K). This results in a deadlock, because: - task B is holding the file range [0, 128K) locked, waiting for the range [128K, 256K) to be unlocked by task A; - task A is holding the file range [128K, 256K) locked and it's waiting for the async data reclaim task to satisfy its space reservation request; - the async data reclaim task is waiting for ordered extent [0, 128K) to complete, but the ordered extent can not complete because the file range [0, 128K) is currently locked by task B, which is waiting on task A to unlock file range [128K, 256K) and task A waiting on the async data reclaim task. This results in a deadlock between 4 task: task A, task B, the async data reclaim task and the task doing ordered extent completion (a work queue task). This type of deadlock can sporadically be triggered by the test case generic/300 from fstests, and results in a stack trace like the following: [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds. [12084.034877] Not tainted 5.18.0-rc2-btrfs-next-115 #1 [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12084.036548] task:kworker/u16:7 state:D stack: 0 pid:123749 ppid: 2 flags:0x00004000 [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [12084.036599] Call Trace: [12084.036601] <TASK> [12084.036606] __schedule+0x3cb/0xed0 [12084.036616] schedule+0x4e/0xb0 [12084.036620] btrfs_start_ordered_extent+0x109/0x1c0 [btrfs] [12084.036651] ? prepare_to_wait_exclusive+0xc0/0xc0 [12084.036659] btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs] [12084.036688] btrfs_work_helper+0xf8/0x400 [btrfs] [12084.036719] ? lock_is_held_type+0xe8/0x140 [12084.036727] process_one_work+0x252/0x5a0 [12084.036736] ? process_one_work+0x5a0/0x5a0 [12084.036738] worker_thread+0x52/0x3b0 [12084.036743] ? process_one_work+0x5a0/0x5a0 [12084.036745] kthread+0xf2/0x120 [12084.036747] ? kthread_complete_and_exit+0x20/0x20 [12084.036751] ret_from_fork+0x22/0x30 [12084.036765] </TASK> [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds. [12084.037702] Not tainted 5.18.0-rc2-btrfs-next-115 #1 [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12084.039506] task:kworker/u16:11 state:D stack: 0 pid:153787 ppid: 2 flags:0x00004000 [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] [12084.039551] Call Trace: [12084.039553] <TASK> [12084.039557] __schedule+0x3cb/0xed0 [12084.039566] schedule+0x4e/0xb0 [12084.039569] schedule_timeout+0xed/0x130 [12084.039573] ? mark_held_locks+0x50/0x80 [12084.039578] ? _raw_spin_unlock_irq+0x24/0x50 [12084.039580] ? lockdep_hardirqs_on+0x7d/0x100 [12084.039585] __wait_for_common+0xaf/0x1f0 [12084.039587] ? usleep_range_state+0xb0/0xb0 [12084.039596] btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs] [12084.039636] btrfs_wait_ordered_roots+0x175/0x240 [btrfs] [12084.039670] flush_space+0x25b/0x630 [btrfs] [12084.039712] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs] [12084.039747] process_one_work+0x252/0x5a0 [12084.039756] ? process_one_work+0x5a0/0x5a0 [12084.039758] worker_thread+0x52/0x3b0 [12084.039762] ? process_one_work+0x5a0/0x5a0 [12084.039765] kthread+0xf2/0x120 [12084.039766] ? kthread_complete_and_exit+0x20/0x20 [12084.039770] ret_from_fork+0x22/0x30 [12084.039783] </TASK> [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds. [12084.040709] Not tainted 5.18.0-rc2-btrfs-next-115 #1 [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12084.042404] task:kworker/u16:17 state:D stack: 0 pid:217907 ppid: 2 flags:0x00004000 [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] [12084.042461] Call Trace: [12084.042463] <TASK> [12084.042471] __schedule+0x3cb/0xed0 [12084.042485] schedule+0x4e/0xb0 [12084.042490] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs] [12084.042539] ? prepare_to_wait_exclusive+0xc0/0xc0 [12084.042551] lock_extent_bits+0x37/0x90 [btrfs] [12084.042601] btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs] [12084.042656] ? lock_is_held_type+0xe8/0x140 [12084.042667] btrfs_work_helper+0xf8/0x400 [btrfs] [12084.042716] ? lock_is_held_type+0xe8/0x140 [12084.042727] process_one_work+0x252/0x5a0 [12084.042742] worker_thread+0x52/0x3b0 [12084.042750] ? process_one_work+0x5a0/0x5a0 [12084.042754] kthread+0xf2/0x120 [12084.042757] ? kthread_complete_and_exit+0x20/0x20 [12084.042763] ret_from_fork+0x22/0x30 [12084.042783] </TASK> [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds. [12084.043598] Not tainted 5.18.0-rc2-btrfs-next-115 #1 [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12084.045244] task:fio state:D stack: 0 pid:234517 ppid:234515 flags:0x00004000 [12084.045248] Call Trace: [12084.045250] <TASK> [12084.045254] __schedule+0x3cb/0xed0 [12084.045263] schedule+0x4e/0xb0 [12084.045266] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs] [12084.045298] ? prepare_to_wait_exclusive+0xc0/0xc0 [12084.045306] lock_extent_bits+0x37/0x90 [btrfs] [12084.045336] btrfs_dio_iomap_begin+0x336/0xc60 [btrfs] [12084.045370] ? lock_is_held_type+0xe8/0x140 [12084.045378] iomap_iter+0x184/0x4c0 [12084.045383] __iomap_dio_rw+0x2c6/0x8a0 [12084.045406] iomap_dio_rw+0xa/0x30 [12084.045408] btrfs_do_write_iter+0x370/0x5e0 [btrfs] [12084.045440] aio_write+0xfa/0x2c0 [12084.045448] ? __might_fault+0x2a/0x70 [12084.045451] ? kvm_sched_clock_read+0x14/0x40 [12084.045455] ? lock_release+0x153/0x4a0 [12084.045463] io_submit_one+0x615/0x9f0 [12084.045467] ? __might_fault+0x2a/0x70 [12084.045469] ? kvm_sched_clock_read+0x14/0x40 [12084.045478] __x64_sys_io_submit+0x83/0x160 [12084.045483] ? syscall_enter_from_user_mode+0x1d/0x50 [12084.045489] do_syscall_64+0x3b/0x90 [12084.045517] entry_SYSCALL_64_after_hwframe+0x44/0xae [12084.045521] RIP: 0033:0x7fa76511af79 [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1 [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79 [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000 [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330 [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0 [12084.045561] </TASK> Fix this issue by always reserving data space before locking a file range at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't error out immediately - instead after locking the file range, check if we can do a NOCOW write, and if we can we don't error out since we don't need to allocate a data extent, however if we can't NOCOW then error out with -ENOSPC. This also implies that we may end up reserving space when it's not needed because the write will end up being done in NOCOW mode - in that case we just release the space after we noticed we did a NOCOW write - this is the same type of logic that is done in the path for buffered IO writes. Fixes: `f0bfa76a11` ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range") CC: stable@vger.kernel.org # 5.17+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
Goldwyn Rodrigues	1d8fa2e29b	btrfs: derive compression type from extent map during reads Derive the compression type from extent map as opposed to the bio flags passed. This makes it more precise and not reliant on function parameters. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
Qu Wenruo	a13467ee7a	btrfs: scrub: move scrub_remap_extent() call into scrub_extent() [SUSPICIOUS CODE] When refactoring scrub code, I noticed a very strange behavior around scrub_remap_extent(): if (sctx->is_dev_replace) scrub_remap_extent(fs_info, cur_logical, scrub_len, &cur_physical, &target_dev, &cur_mirror); As replace target is a 1:1 copy of the source device, thus physical offset inside the target should be the same as physical inside source, thus this remap call makes no sense to me. [REAL FUNCTIONALITY] After more investigation, the function name scrub_remap_extent() doesn't tell anything of the truth, nor does its if () condition. The real story behind this function is that, for scrub_pages() we never expect missing device, even for replacing missing device. What scrub_remap_extent() is really doing is to find a live mirror, and make later scrub_pages() to read data from the good copy, other than from the missing device and increase error counters unnecessarily. [IMPROVEMENT] We have no need to bother scrub_remap_extent() in scrub_simple_mirror() at all, we only need to call it before we call scrub_pages(). And rename the function to scrub_find_live_copy(), add extra comments on them. By this we can remove one parameter from scrub_extent(), and reduce the unnecessary calls to scrub_remap_extent() for regular replace. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
Qu Wenruo	d483bfd27a	btrfs: scrub: use find_first_extent_item to for extent item search Since we have find_first_extent_item() to iterate the extent items of a certain range, there is no need to use the open-coded version. Replace the final scrub call site with find_first_extent_item(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
Qu Wenruo	9ae53bf909	btrfs: scrub: refactor scrub_raid56_parity() Currently scrub_raid56_parity() has a large double loop, handling the following things at the same time: - Iterate each data stripe - Iterate each extent item in one data stripe Refactor this by: - Introduce a new helper to handle data stripe iteration The new helper is scrub_raid56_data_stripe_for_parity(), which only has one while() loop handling the extent items inside the data stripe. The code is still mostly the same as the old code. - Call cond_resched() for each extent Previously we only call cond_resched() under a complex if () check. I see no special reason to do that, and for other scrub functions, like scrub_simple_mirror() we're already doing the same cond_resched() after scrubbing one extent. - Add more comments Please note that, this patch is only to address the double loop, there are incoming patches to do extra cleanup. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:31 +02:00
Qu Wenruo	18d30ab961	btrfs: scrub: use scrub_simple_mirror() to handle RAID56 data stripe scrub Although RAID56 has complex repair mechanism, which involves reading the whole full stripe, but inside one data stripe, it's in fact no different than SINGLE/RAID1. The point here is, for data stripe we just check the csum for each extent we hit. Only for csum mismatch case, our repair paths divide. So we can still reuse scrub_simple_mirror() for RAID56 data stripes, which saves quite some code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:30 +02:00
Qu Wenruo	e430c4287e	btrfs: scrub: cleanup the non-RAID56 branches in scrub_stripe() Since we have moved all other profiles handling into their own functions, now the main body of scrub_stripe() is just handling RAID56 profiles. There is no need to address other profiles in the main loop of scrub_stripe(), so we can remove those dead branches. Since we're here, also slightly change the timing of initialization of variables like @offset, @increment and @logical. Especially for @logical, we don't really need to initialize it for btrfs_extent_root()/btrfs_csum_root(), we can use bg->start for that purpose. Now those variables are only initialize for RAID56 branches. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:30 +02:00
Qu Wenruo	8557635ed2	btrfs: scrub: introduce dedicated helper to scrub simple-stripe based range The new entrance will iterate through each data stripe which belongs to the target device. And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:30 +02:00
Qu Wenruo	09022b14fa	btrfs: scrub: introduce dedicated helper to scrub simple-mirror based range The new helper, scrub_simple_mirror(), will scrub all extents inside a range which only has simple mirror based duplication. This covers every range of SINGLE/DUP/RAID1/RAID1C, and inside each data stripe for RAID0/RAID10. Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C profiles. As one can see, the new entrance for those simple-mirror based profiles can be small enough (with comments, just reach 100 lines). This function will be the basis for the incoming scrub refactor. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:30 +02:00
Qu Wenruo	416bd7e7af	btrfs: scrub: introduce a helper to locate an extent item The new helper, find_first_extent_item(), will locate an extent item (either EXTENT_ITEM or METADATA_ITEM) which covers any byte of the search range. This helper will later be used to refactor scrub code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:30 +02:00
Qu Wenruo	1194a82481	btrfs: calculate physical_end using dev_extent_len directly in scrub_stripe() The variable @physical_end is the exclusive stripe end, currently it's calculated using @physical + @dev_extent_len / map->stripe_len * map->stripe_len. And since at allocation time we ensured dev_extent_len is stripe_len aligned, the result is the same as @physical + @dev_extent_len. So this patch will just assign @physical and @physical_end early, without using @nstripes. This is especially helpful for any possible out: label user, as now we only need to initialize @offset before going to out: label. Since we're here, also make @physical_end constant. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:17:30 +02:00
Gabriel Niebler	48b36a602a	btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray … rename it to simply fs_roots and adjust all usages of this object to use the XArray API, because it is notionally easier to use and understand, as it provides array semantics, and also takes care of locking for us, further simplifying the code. Also do some refactoring, esp. where the API change requires largely rewriting some functions, anyway. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:15:57 +02:00
Gabriel Niebler	8ee922689d	btrfs: turn fs_info member buffer_radix into XArray … named 'extent_buffers'. Also adjust all usages of this object to use the XArray API, which greatly simplifies the code as it takes care of locking and is generally easier to use and understand, providing notionally simpler array semantics. Also perform some light refactoring. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:16 +02:00
Gabriel Niebler	4076942021	btrfs: turn name_cache radix tree into XArray in send_ctx … and adjust all usages of this object to use the XArray API for the sake of consistency. XArray API provides array semantics, so it is notionally easier to use and understand, and it also takes care of locking for us. None of this makes a real difference in this particular patch, but it does in other places where similar replacements are or have been made and we want to be consistent in our usage of data structures in btrfs. Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:16 +02:00
Gabriel Niebler	253bf57555	btrfs: turn delayed_nodes_tree into an XArray … in the btrfs_root struct and adjust all usages of this object to use the XArray API, because it is notionally easier to use and understand, as it provides array semantics, and also takes care of locking for us, further simplifying the code. Also use the opportunity to do some light refactoring. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:16 +02:00
Qu Wenruo	719fae8920	btrfs: use ilog2() to replace if () branches for btrfs_bg_flags_to_raid_index() In function btrfs_bg_flags_to_raid_index(), we use quite some if () to convert the BTRFS_BLOCK_GROUP_* bits to a index number. But the truth is, there is really no such need for so many branches at all. Since all BTRFS_BLOCK_GROUP_* flags are just one single bit set inside BTRFS_BLOCK_GROUP_PROFILES_MASK, we can easily use ilog2() to calculate their values. This calculation has an anchor point, the lowest PROFILE bit, which is RAID0. Even it's fixed on-disk format and should never change, here I added extra compile time checks to make it super safe: 1. Make sure RAID0 is always the lowest bit in PROFILE_MASK This is done by finding the first (least significant) bit set of RAID0 and PROFILE_MASK & ~RAID0. 2. Make sure RAID0 bit set beyond the highest bit of TYPE_MASK Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:16 +02:00
Qu Wenruo	f04fbcc64e	btrfs: move definition of btrfs_raid_types to volumes.h It's only internally used as another way to represent btrfs profiles, it's not exposed through any on-disk format, in fact this btrfs_raid_types is diverted from the on-disk format values. Furthermore, since it's internal structure, its definition can change in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:16 +02:00
Christoph Hellwig	385de0ef38	btrfs: use a normal workqueue for rmw_workers rmw_workers doesn't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:16 +02:00
Christoph Hellwig	be53951826	btrfs: use normal workqueues for scrub All three scrub workqueues don't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Christoph Hellwig	a31b4a4368	btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue Just let the one caller that wants optional WQ_HIGHPRI handling allocate a separate btrfs_workqueue for that. This allows to rename struct __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and separate allocation for all btrfs_workqueue users and generally simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	a7b8e39c92	btrfs: raid56: enable subpage support for RAID56 Now the btrfs RAID56 infrastructure has migrated to use sector_ptr interface, it should be safe to enable subpage support for RAID56. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	3907ce293d	btrfs: raid56: make alloc_rbio_essential_pages() subpage compatible The non-compatible part is only the bitmap iteration part, now the bitmap size is extended to rbio::stripe_nsectors, not the old rbio::stripe_npages. Since we're here, also slightly improve the function by: - Rename @i to @stripe - Rename @bit to @sectornr - Move @page and @index into the inner loop Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	d4e28d9b5f	btrfs: raid56: make steal_rbio() subpage compatible Function steal_rbio() will take all the uptodate pages from the source rbio to destination rbio. With the new stripe_sectors[] array, we also need to do the extra check: - Check sector::flags to make sure the full page is uptodate Now we don't use PageUptodate flag for subpage cases to indicate if the page is uptodate. Instead we need to check all the sectors belong to the page to be sure about whether it's full page uptodate. So here we introduce a new helper, full_page_sectors_uptodate() to do the check. - Update rbio::stripe_sectors[] to use the new page pointer We only need to change the page pointer, no need to change anything else. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	5fdb7afc6f	btrfs: raid56: make set_bio_pages_uptodate() subpage compatible Unlike previous code, we can not directly set PageUptodate for stripe pages now. Instead we have to iterate through all the sectors and set SECTOR_UPTODATE flag there. Introduce a new helper find_stripe_sector(), to do the work. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	ac26df8b3b	btrfs: raid56: remove btrfs_raid_bio::bio_pages array The functionality is completely replaced by the new bio_sectors member, now it's time to remove the old member. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	6346f6bf16	btrfs: raid56: make raid56_add_scrub_pages() subpage compatible This requires one extra parameter @pgoff for the function. In the current code base, scrub is still one page per sector, thus the new parameter will always be 0. It needs the extra subpage scrub optimization code to fully take advantage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	f77183dc1f	btrfs: raid56: open code rbio_stripe_page_index() There is only one caller for that helper now, and we're definitely fine to open-code it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	1145059ae5	btrfs: raid56: make finish_rmw() subpage compatible With this function converted to subpage compatible sector interfaces, the following helper functions can be removed: - rbio_stripe_page() - rbio_pstripe_page() - rbio_qstripe_page() - page_in_rbio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	07e4d38080	btrfs: raid56: make __raid_recover_endio_io() subpage compatible This involves: - Use sector_ptr interface to grab the pointers - Add sector->pgoff to pointers[] - Rebuild data using sectorsize instead of PAGE_SIZE - Use memcpy() to replace copy_page() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	46900662d0	btrfs: raid56: make finish_parity_scrub() subpage compatible The core is to convert direct page usage into sector_ptr usage, and use memcpy() to replace copy_page(). For pointers usage, we need to convert it to kmap_local_page() + sector->pgoff. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	3e77605d6a	btrfs: raid56: make rbio_add_io_page() subpage compatible Make rbio_add_io_page() subpage compatible, which involves: - Rename rbio_add_io_page() to rbio_add_io_sector() Although we still rely on PAGE_SIZE == sectorsize, so add a new ASSERT() inside rbio_add_io_sector() to make sure all pgoff is 0. - Introduce rbio_stripe_sector() helper The equivalent of rbio_stripe_page(). This new helper has extra ASSERT()s to validate the stripe and sector number. - Introduce sector_in_rbio() helper The equivalent of page_in_rbio(). - Rename @pagenr variables to @sectornr - Use rbio::stripe_nsectors when iterating the bitmap Please note that, this only changes the interface, the bios are still using full page for IO. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:15 +02:00
Qu Wenruo	00425dd976	btrfs: raid56: introduce btrfs_raid_bio::bio_sectors This new member is going to fully replace bio_pages in the future, but for now let's keep them co-exist, until the full switch is done. Currently cache_rbio_pages() and index_rbio_pages() will also populate the new array. And cache_rbio_pages() need to record which sectors are uptodate, so we also need to introduce sector_ptr::uptodate bit. To avoid extra memory usage, we let the new @uptodate bit to share bits with @pgoff. Now pgoff only has at most 31 bits, which is already more than enough, as even for 256K page size, we only need 18 bits. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Qu Wenruo	eb3570607c	btrfs: raid56: introduce btrfs_raid_bio::stripe_sectors The new member is an array of sector_ptr pointers, they will represent all sectors inside a full stripe (including P/Q). They co-operate with btrfs_raid_bio::stripe_pages: stripe_pages: \| Page 0, range [0, 64K) \| Page 1 ... stripe_sectors: \| \| \| ... \| \| \| \| \- sector 15, page 0, pgoff=60K \| \- sector 1, page 0, pgoff=4K \---- sector 0, page 0, pfoff=0 With such structure, we can represent subpage sectors without using extra pages. Here we introduce a new helper, index_stripe_sectors(), to update stripe_sectors[] to point to correct page and pgoff. So every time rbio::stripe_pages[] pointer gets updated, the new helper should be called. The following functions have to call the new helper: - steal_rbio() - alloc_rbio_pages() - alloc_rbio_parity_pages() - alloc_rbio_essential_pages() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Qu Wenruo	94efbe19b9	btrfs: raid56: introduce new cached members for btrfs_raid_bio The new members are all related to number of sectors, but the existing number of pages members are kept as is: - nr_sectors Total sectors of the full stripe including P/Q. - stripe_nsectors The sectors of a single stripe. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Qu Wenruo	29b068382c	btrfs: raid56: make btrfs_raid_bio more compact There are a lot of members using much larger type in btrfs_raid_bio than necessary, like nr_pages which represents the total number of a full stripe. Instead of int (which is at least 32bits), u16 is already enough (max stripe length will be 256MiB, already beyond current RAID56 device number limit). So this patch will reduce the width of the following members: - stripe_len to u32 - nr_pages to u16 - nr_data to u8 - real_stripes to u8 - scrubp to u8 - faila/b to s8 As -1 is used to indicate no corruption This will slightly reduce the size of btrfs_raid_bio from 272 bytes to 256 bytes, reducing 16 bytes usage. But please note that, when using btrfs_raid_bio, we allocate extra space for it to cover various pointer array, so the reduce memory is not really a big saving overall. As we're here modifying the comments already, update existing comments to current code standard. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Qu Wenruo	843de58b3e	btrfs: raid56: open code rbio_nr_pages() The function rbio_nr_pages() is only called once inside alloc_rbio(), there is no reason to make it dedicated helper. Furthermore, the return type doesn't match, the function return "unsigned long" which may not be necessary, while the only caller only uses "int". Since we're doing cleaning up here, also fix the type to "const unsigned int" for all involved local variables. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Qu Wenruo	cc353a8be2	btrfs: reduce width for stripe_len from u64 to u32 Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough for the usage. Furthermore, even in the future we choose to enlarge stripe length to larger values, I don't believe we would want stripe as large as 4G or larger. So this patch will reduce the width for all in-memory structures and parameters, this involves: - RAID56 related function argument lists This allows us to do direct division related to stripe_len. Although we will use bits shift to replace the division anyway. - btrfs_io_geometry structure This involves one change to simplify the calculation of both @stripe_nr and @stripe_offset, using div64_u64_rem(). And add extra sanity check to make sure @stripe_offset is always small enough for u32. This saves 8 bytes for the structure. - map_lookup structure This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes, and removed a 4-bytes hole) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Christoph Hellwig	ad357938c6	btrfs: do not return errors from submit_bio_hook_t instances Both btrfs_repair_one_sector and submit_bio_one as the direct caller of one of the instances ignore errors as they expect the methods themselves to call ->bi_end_io on error. Remove the unused and dangerous return value. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Christoph Hellwig	cb4411dd57	btrfs: do not return errors from btrfs_submit_compressed_read btrfs_submit_compressed_read already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Christoph Hellwig	94d9e11b27	btrfs: do not return errors from btrfs_submit_metadata_bio btrfs_submit_metadata_bio already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Christoph Hellwig	abf48d5871	btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bio This argument is unused since commit `953651eb30` ("btrfs: factor out helper adding a page to bio") and commit `1b36294a6c` ("btrfs: call submit_bio_hook directly for metadata pages") reworked the way metadata bio submission is handled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Christoph Hellwig	7aab8b3282	btrfs: move btrfs_readpage to extent_io.c Keep btrfs_readpage next to btrfs_do_readpage and the other address space operations. This allows to keep submit_one_bio and struct btrfs_bio_ctrl file local in extent_io.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Qu Wenruo	d201238ccd	btrfs: repair super block num_devices automatically [BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() \|- btrfs_rm_dev_item(device) \| \|- trans = btrfs_start_transaction() \| \| Now we got transaction X \| \| \| \|- btrfs_del_item() \| \| Now device item is removed from chunk tree \| \| \| \|- btrfs_commit_transaction() \| Transaction X got committed, super num devs untouched, \| but device item removed from chunk tree. \| (AKA, super num devs is already incorrect) \| \|- cur_devices->num_devices--; \|- cur_devices->total_devices--; \|- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. This has been fixed by commit `bbac58698a` ("btrfs: remove device item and update super block in the same transaction"). [FIX] Make the super_num_devices check less strict, converting it from a hard error to a warning, and reset the value to a correct one for the current or next transaction commit. As the number of device items is the critical information where the super block num_devices is only a cached value (and also useful for cross checking), it's safe to automatically update it. Other device related problems like missing device are handled after that and may require other means to resolve, like degraded mount. With this fix, potentially affected filesystems won't fail mount and require the manual repair by btrfs check. Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:14 +02:00
Goldwyn Rodrigues	46fbd18e78	btrfs: do not pass compressed_bio to submit_compressed_bio() Parameter struct compressed_bio is not used by the function submit_compressed_bio(). Remove it. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Filipe Manana	2306e83e73	btrfs: avoid double search for block group during NOCOW writes When doing a NOCOW write, either through direct IO or buffered IO, we do two lookups for the block group that contains the target extent: once when we call btrfs_inc_nocow_writers() and then later again when we call btrfs_dec_nocow_writers() after creating the ordered extent. The lookups require taking a lock and navigating the red black tree used to track all block groups, which can take a non-negligible amount of time for a large filesystem with thousands of block groups, as well as lock contention and cache line bouncing. Improve on this by having a single block group search: making btrfs_inc_nocow_writers() return the block group to its caller and then have the caller pass that block group to btrfs_dec_nocow_writers(). This is part of a patchset comprised of the following patches: btrfs: remove search start argument from first_logical_byte() btrfs: use rbtree with leftmost node cached for tracking lowest block group btrfs: use a read/write lock for protecting the block groups tree btrfs: return block group directly at btrfs_next_block_group() btrfs: avoid double search for block group during NOCOW writes The following test was used to test these changes from a performance perspective: $ cat test.sh #!/bin/bash modprobe null_blk nr_devices=0 NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0 mkdir $NULL_DEV_PATH if [ $? -ne 0 ]; then echo "Failed to create nullb0 directory." exit 1 fi echo 2 > $NULL_DEV_PATH/submit_queues echo 16384 > $NULL_DEV_PATH/size # 16G echo 1 > $NULL_DEV_PATH/memory_backed echo 1 > $NULL_DEV_PATH/power DEV=/dev/nullb0 MNT=/mnt/nullb0 LOOP_MNT="$MNT/loop" MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [io_uring_writes] rw=randwrite fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bs=64k filesize=1g runtime=300 time_based directory=$LOOP_MNT numjobs=8 thread EOF echo performance \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT mkdir $LOOP_MNT truncate -s 4T $MNT/loopfile mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT # Trigger the allocation of about 3500 data block groups, without # actually consuming space on underlying filesystem, just to make # the tree of block group large. fallocate -l 3500G $LOOP_MNT/filler fio /tmp/fio-job.ini umount $LOOP_MNT umount $MNT echo 0 > $NULL_DEV_PATH/power rmdir $NULL_DEV_PATH The test was run on a non-debug kernel (Debian's default kernel config), the result were the following. Before patchset: WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec After patchset: WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec +3.3% write throughput and +3.3% IO done in the same time period. The test has somewhat limited coverage scope, as with only NOCOW writes we get less contention on the red black tree of block groups, since we don't have the extra contention caused by COW writes, namely when allocating data extents, pinning and unpinning data extents, but on the hand there's access to tree in the NOCOW path, when incrementing a block group's number of NOCOW writers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Filipe Manana	8b01f931c1	btrfs: return block group directly at btrfs_next_block_group() At btrfs_next_block_group(), we have this long line with two statements: cache = btrfs_lookup_first_block_group(...); return cache; This makes it a bit harder to read due to two statements on the same line, so change that to directly return the result of the call to btrfs_lookup_first_block_group(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Filipe Manana	16b0c2581e	btrfs: use a read/write lock for protecting the block groups tree Currently we use a spin lock to protect the red black tree that we use to track block groups. Most accesses to that tree are actually read only and for large filesystems, with thousands of block groups, it actually has a bad impact on performance, as concurrent read only searches on the tree are serialized. Read only searches on the tree are very frequent and done when: 1) Pinning and unpinning extents, as we need to lookup the respective block group from the tree; 2) Freeing the last reference of a tree block, regardless if we pin the underlying extent or add it back to free space cache/tree; 3) During NOCOW writes, both buffered IO and direct IO, we need to check if the block group that contains an extent is read only or not and to increment the number of NOCOW writers in the block group. For those operations we need to search for the block group in the tree. Similarly, after creating the ordered extent for the NOCOW write, we need to decrement the number of NOCOW writers from the same block group, which requires searching for it in the tree; 4) Decreasing the number of extent reservations in a block group; 5) When allocating extents and freeing reserved extents; 6) Adding and removing free space to the free space tree; 7) When releasing delalloc bytes during ordered extent completion; 8) When relocating a block group; 9) During fitrim, to iterate over the block groups; 10) etc; Write accesses to the tree, to add or remove block groups, are much less frequent as they happen only when allocating a new block group or when deleting a block group. We also use the same spin lock to protect the list of currently caching block groups. Additions to this list are made when we need to cache a block group, because we don't have a free space cache for it (or we have but it's invalid), and removals from this list are done when caching of the block group's free space finishes. These cases are also not very common, but when they happen, they happen only once when the filesystem is mounted. So switch the lock that protects the tree of block groups from a spinning lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Filipe Manana	08dddb2951	btrfs: use rbtree with leftmost node cached for tracking lowest block group We keep track of the start offset of the block group with the lowest start offset at fs_info->first_logical_byte. This requires explicitly updating that field every time we add, delete or lookup a block group to/from the red black tree at fs_info->block_group_cache_tree. Since the block group with the lowest start address happens to always be the one that is the leftmost node of the tree, we can use a red black tree that caches the left most node. Then when we need the start address of that block group, we can just quickly get the leftmost node in the tree and extract the start offset of that node's block group. This avoids the need to explicitly keep track of that address in the dedicated member fs_info->first_logical_byte, and it also allows the next patch in the series to switch the lock that protects the red black tree from a spin lock to a read/write lock - without this change it would be tricky because block group searches also update fs_info->first_logical_byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Filipe Manana	0eb997bff0	btrfs: remove search start argument from first_logical_byte() The search start argument passed to first_logical_byte() is always 0, as we always want to get the logical start address of the block group with the lowest logical start address. So remove it, as not only it is not necessary, it also makes the following patches that change the lock that protects the red black tree of block groups from a spin lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Qu Wenruo	44e5801fad	btrfs: return correct error number for __extent_writepage_io() [BUG] If we hit an error from submit_extent_page() inside __extent_writepage_io(), we could still return 0 to the caller, and even trigger the warning in btrfs_page_assert_not_dirty(). [CAUSE] In __extent_writepage_io(), if we hit an error from submit_extent_page(), we will just clean up the range and continue. This is completely fine for regular PAGE_SIZE == sectorsize, as we can only hit one sector in one page, thus after the error we're ensured to exit and @ret will be saved. But for subpage case, we may have other dirty subpage range in the page, and in the next loop, we may succeeded submitting the next range. In that case, @ret will be overwritten, and we return 0 to the caller, while we have hit some error. [FIX] Introduce @has_error and @saved_ret to record the first error we hit, so we will never forget what error we hit. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Qu Wenruo	10f7f6f879	btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage() [BUG] Test case generic/475 have a very high chance (almost 100%) to hit a fs hang, where a data page will never be unlocked and hang all later operations. [CAUSE] In btrfs_do_readpage(), if we hit an error from submit_extent_page() we will try to do the cleanup for our current io range, and exit. This works fine for PAGE_SIZE == sectorsize cases, but not for subpage. For subpage btrfs_do_readpage() will lock the full page first, which can contain several different sectors and extents: btrfs_do_readpage() \|- begin_page_read() \| \|- btrfs_subpage_start_reader(); \| Now the page will have PAGE_SIZE / sectorsize reader pending, \| and the page is locked. \| \|- end_page_read() for different branches \| This function will reduce subpage readers, and when readers \| reach 0, it will unlock the page. But when submit_extent_page() failed, we only cleanup the current io range, while the remaining io range will never be cleaned up, and the page remains locked forever. [FIX] Update the error handling of submit_extent_page() to cleanup all the remaining subpage range before exiting the loop. Please note that, now submit_extent_page() can only fail due to sanity check in alloc_new_bio(). Thus regular IO errors are impossible to trigger the error path. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Qu Wenruo	c9583ada8c	btrfs: avoid double clean up when submit_one_bio() failed [BUG] When running generic/475 with 64K page size and 4K sector size, it has a very high chance (almost 100%) to hang, with mostly data page locked but no one is going to unlock it. [CAUSE] With commit `1784b7d502` ("btrfs: handle csum lookup errors properly on reads"), if we failed to lookup checksum due to metadata IO error, we will return error for btrfs_submit_data_bio(). This will cause the page to be unlocked twice in btrfs_do_readpage(): btrfs_do_readpage() \|- submit_extent_page() \| \|- submit_one_bio() \| \|- btrfs_submit_data_bio() \| \|- if (ret) { \| \|- bio->bi_status = ret; \| \|- bio_endio(bio); } \| In the endio function, we will call end_page_read() \| and unlock_extent() to cleanup the subpage range. \| \|- if (ret) { \|- unlock_extent(); end_page_read() } Here we unlock the extent and cleanup the subpage range again. For unlock_extent(), it's mostly double unlock safe. But for end_page_read(), it's not, especially for subpage case, as for subpage case we will call btrfs_subpage_end_reader() to reduce the reader number, and use that to number to determine if we need to unlock the full page. If double accounted, it can underflow the number and leave the page locked without anyone to unlock it. [FIX] The commit `1784b7d502` ("btrfs: handle csum lookup errors properly on reads") itself is completely fine, it's our existing code not properly handling the error from bio submission hook properly. This patch will make submit_one_bio() to return void so that the callers will never be able to do cleanup when bio submission hook fails. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Schspa Shi	dd7382a2a7	btrfs: use non-bh spin_lock in zstd timer callback This is an optimization for fix `fee13fe965` ("btrfs: correct zstd workspace manager lock to use spin_lock_bh()") The critical region for wsm.lock is only accessed by the process context and the softirq context. Because in the soft interrupt, the critical section will not be preempted by the soft interrupt again, there is no need to call spin_lock_bh(&wsm.lock) to turn off the soft interrupt, spin_lock(&wsm.lock) is enough for this situation. Signed-off-by: Schspa Shi <schspa@gmail.com> [ minor comment update ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Filipe Manana	490243884e	btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode() We are still using the magic value of 2 at btrfs_create_new_inode(), but there's now a constant for that, named BTRFS_DIR_START_INDEX, which was introduced in commit `528ee69712` ("btrfs: put initial index value of a directory in a constant"). So change that to use the constant. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Qu Wenruo	c0111c4417	btrfs: simplify parameters of submit_read_repair() and rename Cleanup the function submit_read_repair() by: - Remove the fixed argument submit_bio_hook() The function is only called on buffered data read path, so the @submit_bio_hook argument is always btrfs_submit_data_bio(). Since it's fixed, then there is no need to pass that argument at all. - Rename the function to submit_data_read_repair() Just to be more explicit on all the 3 things, data, read and repair. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:13 +02:00
Christoph Hellwig	8e010b3d70	btrfs: remove the zoned/zone_size union in struct btrfs_fs_info Reading a value from a different member of a union is not just a great way to obfuscate code, but also creates an aliasing violation. Switch btrfs_is_zoned to look at ->zone_size and remove the union. Note: union was to simplify the detection of zoned filesystem but now this is wrapped behind btrfs_is_zoned so we can drop the union. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Lv Ruyi	8aa1e49ea1	btrfs: remove unnecessary check of iput argument iput() already handles NULL and non-NULL parameter, so it is not needed to check that. This unifies all iput calls. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	b027669449	btrfs: stop using the btrfs_bio saved iter in index_rbio_pages The bios added to ->bio_list are the original bios fed into btrfs_map_bio, which are never advanced. Just use the iter in the bio itself. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	75c17e6666	btrfs: don't allocate a btrfs_bio for scrub bios All the scrub bios go straight to the block device or the raid56 code, none of which looks at the btrfs_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	e1b4b44e00	btrfs: don't allocate a btrfs_bio for raid56 per-stripe bios Except for the spurious initialization of ->device just after allocation nothing uses the btrfs_bio, so just allocate a normal bio without extra data. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	e01bf588f8	btrfs: pass bio opf to rbio_add_io_page Prepare for further refactoring by moving this initialization to a single place instead of setting it in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	110ac0e543	btrfs: pass a block_device to btrfs_bio_clone Pass the block_device to bio_alloc_clone instead of setting it later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	fce3f24ada	btrfs: move the call to bio_set_dev out of submit_stripe_bio Prepare for additional refactoring, btrfs_map_bio is direct caller of submit_stripe_bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	f77dcc0d64	btrfs: use on-stack bio in scrub_repair_page_from_good_copy The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	f3b8a7f3fb	btrfs: use on-stack bio in scrub_recheck_block The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	e9458bfe5f	btrfs: use on-stack bio in repair_io_failure The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Also cleanup the error handling to use goto labels and not discard the actual return values. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	91e3b5f1e2	btrfs: check-integrity: simplify bio allocation in btrfsic_read_block btrfsic_read_block does not need the btrfs_bio structure, so switch to plain bio_alloc (that also does not fail as it's backed by a bioset). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	58ff51f148	btrfs: check-integrity: split submit_bio from btrfsic checking Require a separate call to the integrity checking helpers from the actual bio submission. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:12 +02:00
Christoph Hellwig	57906d58e2	btrfs: factor check and flush helpers from __btrfsic_submit_bio Split out two helpers to make __btrfsic_submit_bio more readable. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Johannes Thumshirn	3687fcb075	btrfs: zoned: make auto-reclaim less aggressive The current auto-reclaim algorithm starts reclaiming all block groups with a zone_unusable value above a configured threshold. This is causing a lot of reclaim IO even if there would be enough free zones on the device. Instead of only accounting a block groups zone_unusable value, also take the ratio of free and not usable (written as well as zone_unusable) bytes a device has into account. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Josef Bacik	ef972e7b5e	btrfs: change the bg_reclaim_threshold valid region from 0 to 100 For the non-zoned case we may want to set the threshold for reclaim to something below 50%. Change the acceptable threshold from 50-100 to 0-100. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Josef Bacik	ac2f1e63c6	btrfs: allow block group background reclaim for non-zoned filesystems This will allow us to set a threshold for block groups to be automatically relocated even if we don't have zoned devices. We have found this feature invaluable at Facebook due to how our workload interacts with the allocator. We have been using this in production for months with only a single problem that has already been fixed. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Josef Bacik	bb5a098d97	btrfs: make the bg_reclaim_threshold per-space info For non-zoned file systems it's useful to have the auto reclaim feature, however there are different use cases for non-zoned, for example we may not want to reclaim metadata chunks ever, only data chunks. Move this sysfs flag to per-space_info. This won't affect current users because this tunable only ever did anything for zoned, and that is currently hidden behind BTRFS_CONFIG_DEBUG. Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ jth restore global bg_reclaim_threshold ] Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Filipe Manana	a7bb6bd4bd	btrfs: do not test for free space inode during NOCOW check against file extent When checking if we can do a NOCOW write against a range covered by a file extent item, we do a quick a check to determine if the inode's root was snapshotted in a generation older than the generation of the file extent item or not. This is to quickly determine if the extent is likely shared and avoid the expensive check for cross references (this was added in commit `78d4295b1e` ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path"). We restrict that check to the case where the inode is not a free space inode (since commit `27a7ff554e` ("btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow")). That is because when we had the inode cache feature, inode caches were backed by a free space inode that belonged to the inode's root. However we don't have support for the inode cache feature since kernel 5.11, so we don't need this check anymore since free space inodes are now always related to free space caches, which are always associated to the root tree (which can't be snapshotted, and its last_snapshot field is always 0). So remove that condition. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Filipe Manana	619104ba45	btrfs: move common NOCOW checks against a file extent into a helper Verifying if we can do a NOCOW write against a range fully or partially covered by a file extent item requires verifying several constraints, and these are currently duplicated at two different places: can_nocow_extent() and run_delalloc_nocow(). This change moves those checks into a common helper function to avoid duplication. It adds some comments and also preserves all existing behaviour like for example can_nocow_extent() treating errors from the calls to btrfs_cross_ref_exist() and csum_exist_in_range() as meaning we can not NOCOW, instead of propagating the error back to the caller. That specific behaviour is questionable but also reasonable to some degree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy	395cb57e85	btrfs: wait between incomplete batch memory allocations When allocating memory in a loop, each iteration should call memalloc_retry_wait() in order to prevent starving memory-freeing processes (and to mark where allocation loops are). Other filesystems do that as well. The bulk page allocation is the only place in btrfs with an allocation retry loop, so add an appropriate call to it. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy	91d6ac1d62	btrfs: allocate page arrays using bulk page allocator While calling alloc_page() in a loop is an effective way to populate an array of pages, the MM subsystem provides a method to allocate pages in bulk. alloc_pages_bulk_array() populates the NULL slots in a page array, trying to grab more than one page at a time. Unfortunately, it doesn't guarantee allocating all slots in the array, but it's easy to call it in a loop and return an error if no progress occurs. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy	dd137dd1f2	btrfs: factor out allocating an array of pages Several functions currently populate an array of page pointers one allocated page at a time. Factor out the common code so as to allow improvements to all of the sites at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Yu Zhe	0d031dc4aa	btrfs: remove unnecessary type casts Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Qu Wenruo	1a42daab11	btrfs: expand subpage support to any PAGE_SIZE > 4K With the recent change in metadata handling, we can handle metadata in the following cases: - nodesize < PAGE_SIZE and sectorsize < PAGE_SIZE Go subpage routine for both metadata and data. - nodesize < PAGE_SIZE and sectorsize >= PAGE_SIZE Invalid case for now. As we require nodesize >= sectorsize. - nodesize >= PAGE_SIZE and sectorsize < PAGE_SIZE Go subpage routine for data, but regular page routine for metadata. - nodesize >= PAGE_SIZE and sectorsize >= PAGE_SIZE Go regular page routine for both metadata and data. Now we can handle any sectorsize < PAGE_SIZE, plus the existing sectorsize == PAGE_SIZE support. But here we introduce an artificial limit, any PAGE_SIZE > 4K case, we will only support 4K and PAGE_SIZE as sector size. The idea here is to reduce the test combinations, and push 4K as the default standard in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Qu Wenruo	fbca46eb46	btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine The reason why we only support 64K page size for subpage is, for 64K page size we can ensure no matter what the nodesize is, we can fit it into one page. When other page size come, especially like 16K, the limitation is a bit limiting. To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the non-subpage routine. By this, we can allow 4K sectorsize on 16K page size. Although this introduces another smaller limitation, the metadata can not cross page boundary, which is already met by most recent mkfs. Another small improvement is, we can avoid the overhead for metadata if nodesize >= PAGE_SIZE. For 4K sector size and 64K page size/node size, or 4K sector size and 16K page size/node size, we don't need to allocate extra memory for the metadata pages. Please note that, this patch will not yet enable other page size support yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:11 +02:00
Qu Wenruo	e959d3c1df	btrfs: use dummy extent buffer for super block sys chunk array read In function btrfs_read_sys_array(), we allocate a real extent buffer using btrfs_find_create_tree_block(). Such extent buffer will be even cached into buffer_radix tree, and using btree inode address space. However we only use such extent buffer to enable the accessors, thus we don't even need to bother using real extent buffer, a dummy one is what we really need. And for dummy extent buffer, we no longer need to do any special handling for the first page, as subpage helper is already doing it properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Naohiro Aota	0320b3538b	btrfs: assert that relocation is protected with sb_start_write() Relocation of a data block group creates ordered extents. They can cause a hang when a process is trying to thaw the filesystem. We should have called sb_start_write(), so the filesystem is not being frozen. Add an ASSERT to check it is protected. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Nikolay Borisov	d864546231	btrfs: simplify code flow in btrfs_ioctl_balance Move code in btrfs_ioctl_balance to simplify its flow. This is possible thanks to the removal of balance v1 ioctl and ensuring 'arg' argument is always present. First move the code duplicating the userspace arg to the kernel 'barg'. This makes the out_unlock label redundant. Secondly, check the validity of bargs::flags before copying to the dynamically allocated 'bctl'. This removes the need for the out_bctl label. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Nikolay Borisov	398646011e	btrfs: remove checks for arg argument in btrfs_ioctl_balance With the removal of balance v1 ioctl the 'arg' argument is guaranteed to be present so simply remove all conditional code which checks for its presence. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Qu Wenruo	b06660b595	btrfs: replace memset with memzero_page in data checksum verification The original code resets the page to 0x1 for not apparent reason, it's been like that since the initial 2007 code added in commit `07157aacb1` ("Btrfs: Add file data csums back in via hooks in the extent map code"). It could mean that a failed buffer can be detected from the data but that's just a guess and any value is good. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	d4135134ab	btrfs: avoid blocking on space revervation when doing nowait dio writes When doing a NOWAIT direct IO write, if we can NOCOW then it means we can proceed with the non-blocking, NOWAIT path. However reserving the metadata space and qgroup meta space can often result in blocking - flushing delalloc, wait for ordered extents to complete, trigger transaction commits, etc, going against the semantics of a NOWAIT write. So make the NOWAIT write path to try to reserve all the metadata it needs without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT then return -EAGAIN to make the caller fallback to a blocking direct IO write. This is part of a patchset comprised of the following patches: btrfs: avoid blocking on page locks with nowait dio on compressed range btrfs: avoid blocking nowait dio when locking file range btrfs: avoid double nocow check when doing nowait dio writes btrfs: stop allocating a path when checking if cross reference exists btrfs: free path at can_nocow_extent() before checking for checksum items btrfs: release path earlier at can_nocow_extent() btrfs: avoid blocking when allocating context for nowait dio read/write btrfs: avoid blocking on space revervation when doing nowait dio writes The following test was run before and after applying this patchset: $ cat io-uring-nodatacow-test.sh #!/bin/bash DEV=/dev/sdc MNT=/mnt/sdc MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=4 FILE_SIZE=8G RUN_TIME=300 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randrw fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run a 12 cores box with 64G of ram, using a non-debug kernel config (Debian's default config) and a spinning disk. Result before the patchset: READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec Result after the patchset: READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec That's about +7.2% throughput for reads and +6.9% for writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	4f208dcc6b	btrfs: avoid blocking when allocating context for nowait dio read/write When doing a NOWAIT direct IO read/write, we allocate a context object (struct btrfs_dio_data) with GFP_NOFS, which can result in blocking waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM \| __GFP_IO). This is undesirable for the NOWAIT semantics, so do the allocation with GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails return -EAGAIN, so that the caller can fallback to a blocking context and retry with a non-blocking write. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	59d35c5171	btrfs: release path earlier at can_nocow_extent() At can_nocow_extent(), we are releasing the path only after checking if the block group that has the target extent is read only, and after checking if there's delalloc in the range in case our extent is a preallocated extent. The read only extent check can be expensive if we have a very large filesystem with many block groups, as well as the check for delalloc in the inode's io_tree in case the io_tree is big due to IO on other file ranges. Our path is holding a read lock on a leaf and there's no need to keep the lock while doing those two checks, so release the path before doing them, immediately after the last use of the leaf. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	c1a548db25	btrfs: free path at can_nocow_extent() before checking for checksum items When we look for checksum items, through csum_exist_in_range(), at can_nocow_extent(), we no longer need the path that we have previously allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(), we also end up allocating a path, so we are adding unnecessary extra memory usage. So free the path before calling csum_exist_in_range(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	1a89f17386	btrfs: stop allocating a path when checking if cross reference exists At btrfs_cross_ref_exist() we always allocate a path, but we really don't need to because all its callers (only 2) already have an allocated path that is not being used when they call btrfs_cross_ref_exist(). So change btrfs_cross_ref_exist() to take a path as an argument and update both its callers to pass in the unused path they have when they call btrfs_cross_ref_exist(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	d7a8ab4e9b	btrfs: avoid double nocow check when doing nowait dio writes When doing a NOWAIT direct IO write we are checking twice if we can COW into the target file range using can_nocow_extent() - once at the very beginning of the write path, at btrfs_write_check() via check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write(). The can_nocow_extent() function does a lot of expensive things - searching for the file extent item in the inode's subvolume tree, searching for the extent item in the extent tree, checking delayed references, etc, so it isn't a very cheap call. We can remove the first check at btrfs_write_check(), and add there a quick check to verify if the inode has the NODATACOW or PREALLOC flags, and quickly bail out if it doesn't have neither of those flags, as that means we have to COW and therefore can't comply with the NOWAIT semantics. After this we do only one call to can_nocow_extent(), while we are at btrfs_get_blocks_direct_write(), where we have already locked the file range and we did a try lock on the range before, at btrfs_dio_iomap_begin() (since the previous patch in the series). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:10 +02:00
Filipe Manana	5909440344	btrfs: avoid blocking nowait dio when locking file range If we are doing a NOWAIT direct IO read/write, we can block when locking the file range at btrfs_dio_iomap_begin(), as it's possible the range (or a part of it) is already locked by another task (mmap writes, another direct IO read/write racing with us, fiemap, etc). We are also waiting for completion of any ordered extent we find in the range, which also can block us for a significant amount of time. There's also the incorrect fallback to buffered IO (returning -ENOTBLK) when we are dealing with a NOWAIT request and we can't proceed. In this case we should be returning -EAGAIN, as falling back to buffered IO can result in blocking for many different reasons, so that the caller can delegate a retry to a context where blocking is more acceptable. Fix these cases by: 1) Doing a try lock on the file range and failing with -EAGAIN if we can not lock right away; 2) Fail with -EAGAIN if we find an ordered extent; 3) Return -EAGAIN instead of -ENOTBLK when we need to fallback to buffered IO and we have a NOWAIT request. This will also allow us to avoid a duplicated check that verifies if we are able to do a NOCOW write for NOWAIT direct IO writes, done in the next patch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	b023e67512	btrfs: avoid blocking on page locks with nowait dio on compressed range If we are doing NOWAIT direct IO read/write and our inode has compressed extents, we call filemap_fdatawrite_range() against the range in order to wait for compressed writeback to complete, since the generic code at iomap_dio_rw() calls filemap_write_and_wait_range() once, which is not enough to wait for compressed writeback to complete. This call to filemap_fdatawrite_range() can block on page locks, since the first writepages() on a range that we will try to compress results only in queuing a work to compress the data while holding the pages locked. Even though the generic code at iomap_dio_rw() will do the right thing and return -EAGAIN for NOWAIT requests in case there are pages in the range, we can still end up at btrfs_dio_iomap_begin() with pages in the range because either of the following can happen: 1) Memory mapped writes, as we haven't locked the range yet; 2) Buffered reads might have started, which lock the pages, and we do the filemap_fdatawrite_range() call before locking the file range. So don't call filemap_fdatawrite_range() at btrfs_dio_iomap_begin() if we are doing a NOWAIT read/write. Instead call filemap_range_needs_writeback() to check if there are any locked, dirty, or under writeback pages, and return -EAGAIN if that's the case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Jonathan Lassoff	b0a66a3137	btrfs: add messages to printk index In order for end users to quickly react to new issues that come up in production, it is proving useful to leverage this printk indexing system. This printk index enables kernel developers to use calls to printk() with changeable ad-hoc format strings, while still enabling end users to detect changes and develop a semi-stable interface for detecting and parsing these messages. So that detailed Btrfs messages are captured by this printk index, this patch wraps btrfs_printk and btrfs_handle_fs_error with macros. Example of the generated list: https://lore.kernel.org/lkml/12588e13d51a9c3bf59467d3fc1ac2162f1275c1.1647539056.git.jof@thejof.com Signed-off-by: Jonathan Lassoff <jof@thejof.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Qu Wenruo	88c602ab44	btrfs: tree-checker: check extent buffer owner against owner rootid Btrfs doesn't check whether the tree block respects the root owner. This means, if a tree block referred by a parent in extent tree, but has owner of 5, btrfs can still continue reading the tree block, as long as it doesn't trigger other sanity checks. Normally this is fine, but combined with the empty tree check in check_leaf(), if we hit an empty extent tree, but the root node has csum tree owner, we can let such extent buffer to sneak in. Shrink the hole by: - Do extra eb owner check at tree read time - Make sure the root owner extent buffer exactly matches the root id. Unfortunately we can't yet completely patch the hole, there are several call sites can't pass all info we need: - For reloc/log trees Their owner is key::offset, not key::objectid. We need the full root key to do that accurate check. For now, we just skip the ownership check for those trees. - For add_data_references() of relocation That call site doesn't have any parent/ownership info, as all the bytenrs are all from btrfs_find_all_leafs(). - For direct backref items walk Direct backref items records the parent bytenr directly, thus unlike indirect backref item, we don't do a full tree search. Thus in that case, we don't have full parent owner to check. For the later two cases, they all pass 0 as @owner_root, thus we can skip those cases if @owner_root is 0. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	63c34cb4c6	btrfs: add and use helper to assert an inode range is clean We have four different scenarios where we don't expect to find ordered extents after locking a file range: 1) During plain fallocate; 2) During hole punching; 3) During zero range; 4) During reflinks (both cloning and deduplication). This is because in all these cases we follow the pattern: 1) Lock the inode's VFS lock in exclusive mode; 2) Lock the inode's i_mmap_lock in exclusive node, to serialize with mmap writes; 3) Flush delalloc in a file range and wait for all ordered extents to complete - both done through btrfs_wait_ordered_range(); 4) Lock the file range in the inode's io_tree. So add a helper that asserts that we don't have ordered extents for a given range. Make the four scenarios listed above use this helper after locking the respective file range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	55961c8abf	btrfs: remove ordered extent check and wait during hole punching and zero range For hole punching and zero range we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit `9902af79c0` ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit `e9adabb971` ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit `8d9b4a162a` ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing hole punching and zero range operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	bd6526d0df	btrfs: lock the inode first before flushing range when punching hole When doing hole punching we are flushing delalloc and waiting for ordered extents to complete before locking the inode (VFS lock and the btrfs specific i_mmap_lock). This is fine because even if a write happens after we call btrfs_wait_ordered_range() and before we lock the inode (call btrfs_inode_lock()), we will notice the write at btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered extent. We can however make this simpler by locking first the inode an then call btrfs_wait_ordered_range(), which will allow us to remove the ordered extent lookup logic from btrfs_punch_hole_lock_range() in the next patch. It also makes the behaviour the same as plain fallocate, hole punching and reflinks. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	ffa8fc603d	btrfs: remove ordered extent check and wait during fallocate For fallocate() we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit `9902af79c0` ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit `e9adabb971` ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit `8d9b4a162a` ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing a plain fallocate. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	1c6cbbbeee	btrfs: remove inode_dio_wait() calls when starting reflink operations When starting a reflink operation we have these calls to inode_dio_wait() which used to be needed because direct IO writes that don't cross the i_size boundary did not take the inode's VFS lock, so we could race with them and end up with ordered extents in target range after calling btrfs_wait_ordered_range(). However that is not the case anymore, because the inode's VFS lock was changed from a mutex to a rw semaphore, by commit `9902af79c0` ("parallel lookups: actual switch to rwsem"), and several years later we started to lock the inode's VFS lock in shared mode for direct IO writes that don't cross the i_size boundary (commit `e9adabb971` ("btrfs: use shared lock for direct writes within EOF")). So remove those inode_dio_wait() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	831e1ee602	btrfs: remove useless dio wait call when doing fallocate zero range When starting a fallocate zero range operation, before getting the first extent map for the range, we make a call to inode_dio_wait(). This logic was needed in the past because direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (by commit `9902af79c0` ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (done in commit `e9adabb971` ("btrfs: use shared lock for direct writes within EOF")). The lockless direct IO writes could result in a race with the zero range operation, resulting in the later getting a stale extent map for the range. So remove this no longer needed call to inode_dio_wait(), as fallocate takes the inode's VFS lock in exclusive mode and direct IO writes within i_size take that same lock in shared mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Filipe Manana	47e1d1c7bb	btrfs: only reserve the needed data space amount during fallocate During a plain fallocate, we always start by reserving an amount of data space that matches the length of the range passed to fallocate. When we already have extents allocated in that range, we may end up trying to reserve a lot more data space then we need, which can result in several undesired behaviours: 1) We fail with -ENOSPC. For example the passed range has a length of 1G, but there's only one hole with a size of 1M in that range; 2) We temporarily reserve excessive data space that could be used by other operations happening concurrently; 3) By reserving much more data space then we need, we can end up doing expensive things like triggering dellaloc for other inodes, waiting for the ordered extents to complete, trigger transaction commits, allocate new block groups, etc. Example: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f -b 1g $DEV mount $DEV $MNT # Create a file with a size of 600M and two holes, one at [200M, 201M[ # and another at [401M, 402M[ xfs_io -f -c "pwrite -S 0xab 0 200M" \ -c "pwrite -S 0xcd 201M 200M" \ -c "pwrite -S 0xef 402M 198M" \ $MNT/foobar # Now call fallocate against the whole file range, see if it fails # with -ENOSPC or not - it shouldn't since we only need to allocate # 2M of data space. xfs_io -c "falloc 0 600M" $MNT/foobar umount $MNT $ ./test.sh (...) wrote 209715200/209715200 bytes at offset 0 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) wrote 209715200/209715200 bytes at offset 210763776 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) wrote 207618048/207618048 bytes at offset 421527552 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) fallocate: No space left on device $ So fix this by not allocating an amount of data space that matches the length of the range passed to fallocate. Instead allocate an amount of data space that corresponds to the sum of the sizes of each hole found in the range. This reservation now happens after we have locked the file range, which is safe since we know at this point there's no delalloc in the range because we've taken the inode's VFS lock in exclusive mode, we have taken the inode's i_mmap_lock in exclusive mode, we have flushed delalloc and waited for all ordered extents in the range to complete. This type of failure actually seems to happen in practice with systemd, and we had at least one report about this in a very long thread which is referenced by the Link tag below. Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Sweet Tea Dorminy	6c3636ebe3	btrfs: restore inode creation before xattr setting According to the tree checker, "all xattrs with a given objectid follow the inode with that objectid in the tree" is an invariant. This was broken by the recent change "btrfs: move common inode creation code into btrfs_create_new_inode()", which moved acl creation and property inheritance (stored in xattrs) to before inode insertion into the tree. As a result, under certain timings, the xattrs could be written to the tree before the inode, causing the tree checker to report violation of the invariant. Move property inheritance and acl creation back to their old ordering after the inode insertion. Suggested-by: Omar Sandoval <osandov@osandov.com> Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:09 +02:00
Omar Sandoval	caae78e032	btrfs: move common inode creation code into btrfs_create_new_inode() All of our inode creation code paths duplicate the calls to btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation additionally duplicates property inheritance and the call to btrfs_set_inode_index(). Fix this by moving the common code into btrfs_create_new_inode(). This accomplishes a few things at once: 1. It reduces code duplication. 2. It allows us to set up the inode completely before inserting the inode item, removing calls to btrfs_update_inode(). 3. It fixes a leak of an inode on disk in some error cases. For example, in btrfs_create(), if btrfs_new_inode() succeeds, then we have inserted an inode item and its inode ref. However, if something after that fails (e.g., btrfs_init_inode_security()), then we end the transaction and then decrement the link count on the inode. If the transaction is committed and the system crashes before the failed inode is deleted, then we leak that inode on disk. Instead, this refactoring aborts the transaction when we can't recover more gracefully. 4. It exposes various ways that subvolume creation diverges from mkdir in terms of inheriting flags, properties, permissions, and POSIX ACLs, a lot of which appears to be accidental. This patch explicitly does _not_ change the existing non-standard behavior, but it makes those differences more clear in the code and documents them so that we can discuss whether they should be changed. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Omar Sandoval	3538d68dbd	btrfs: reserve correct number of items for inode creation The various inode creation code paths do not account for the compression property, POSIX ACLs, or the parent inode item when starting a transaction. Fix it by refactoring all of these code paths to use a new function, btrfs_new_inode_prepare(), which computes the correct number of items. To do so, it needs to know whether POSIX ACLs will be created, so move the ACL creation into that function. To reduce the number of arguments that need to be passed around for inode creation, define struct btrfs_new_inode_args containing all of the relevant information. btrfs_new_inode_prepare() will also be a good place to set up the fscrypt context and encrypted filename in the future. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Omar Sandoval	5f465bf1f1	btrfs: factor out common part of btrfs_{mknod,create,mkdir}() btrfs_{mknod,create,mkdir}() are now identical other than the inode initialization and some inconsequential function call order differences. Factor out the common code to reduce code duplication. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Omar Sandoval	a1fd0c35ff	btrfs: allocate inode outside of btrfs_new_inode() Instead of calling new_inode() and inode_init_owner() inside of btrfs_new_inode(), do it in the callers. This allows us to pass in just the inode instead of the mnt_userns and mode and removes the need for memalloc_nofs_{save,restores}() since we do it before starting a transaction. In create_subvol(), it also means we no longer have to look up the inode again to instantiate it. This also paves the way for some more cleanups in later patches. This also removes the comments about Smack checking i_op, which are no longer true since commit `5d6c31910b` ("xattr: Add __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags & IOP_XATTR, which is set based on sb->s_xattr. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Qu Wenruo	b95b78e628	btrfs: warn when extent buffer leak test fails Although we have btrfs_extent_buffer_leak_debug_check() (enabled by CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have some extent buffer leakage, it's just pr_err(), not noisy enough for fstests to cache. So here we trigger a WARN_ON() if the allocated_ebs list is not empty. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Anand Jain	b67d73c1ff	btrfs: use a local variable for fs_devices pointer in btrfs_dev_replace_finishing In the function btrfs_dev_replace_finishing, we dereferenced fs_info->fs_devices 6 times. Use keep local variable for that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Gabriel Niebler	184b3d1900	btrfs: use btrfs_for_each_slot in btrfs_listxattr This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Gabriel Niebler	43cb1478de	btrfs: use btrfs_for_each_slot in btrfs_read_chunk_tree This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00
Gabriel Niebler	3d64f060a7	btrfs: use btrfs_for_each_slot in btrfs_unlink_all_paths This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-16 17:03:08 +02:00

... 4 5 6 7 8 ...

11344 Commits