OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Johannes Thumshirn	806570c0bb	btrfs: handle memory allocation failure in btrfs_csum_one_bio Since `f8a53bb58e` ("btrfs: handle checksum generation in the storage layer") the failures of btrfs_csum_one_bio() are handled via bio_end_io(). This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in case the allocation of the ordered sums fails. This also fixes a syzkaller report, where injecting a failure into the kvzalloc() call results in a BUG_ON(). Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-05-17 13:08:28 +02:00
Boris Burkov	e7db9e5c6b	btrfs: fix encoded write i_size corruption with no-holes We have observed a btrfs filesystem corruption on workloads using no-holes and encoded writes via send stream v2. The symptom is that a file appears to be truncated to the end of its last aligned extent, even though the final unaligned extent and even the file extent and otherwise correctly updated inode item have been written. So if we were writing out a 1MiB+X file via 8 128K extents and one extent of length X, i_size would be set to 1MiB, but the ninth extent, nbyte, etc. would all appear correct otherwise. The source of the race is a narrow (one line of code) window in which a no-holes fs has read in an updated i_size, but has not yet set a shared disk_i_size variable to write. Therefore, if two ordered extents run in parallel (par for the course for receive workloads), the following sequence can play out: (following "threads" a bit loosely, since there are callbacks involved for endio but extra threads aren't needed to cause the issue) ENC-WR1 (second to last) ENC-WR2 (last) ------- ------- btrfs_do_encoded_write set i_size = 1M submit bio B1 ending at 1M endio B1 btrfs_inode_safe_disk_i_size_write local i_size = 1M falls off a cliff for some reason btrfs_do_encoded_write set i_size = 1M+X submit bio B2 ending at 1M+X endio B2 btrfs_inode_safe_disk_i_size_write local i_size = 1M+X disk_i_size = 1M+X disk_i_size = 1M btrfs_delayed_update_inode btrfs_delayed_update_inode And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and writes respect i_size and present a corrupted file missing its last extents. Fix this by holding the inode lock in the no-holes case so that a thread can't sneak in a write to disk_i_size that gets overwritten with an out of date i_size. Fixes: `41a2ee75aa` ("btrfs: introduce per-inode file extent tree") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-05-02 14:21:00 +02:00
Qu Wenruo	b979547513	btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info filled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid warning on unused static function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:23 +02:00
Christoph Hellwig	e2eb02480c	btrfs: cleanup the main loop in btrfs_lookup_bio_sums Introduce a bio_offset variable for the current offset into the bio instead of recalculating it over and over. Remove the now only used once search_len and sector_offset variables, and reduce the scope for count and cur_disk_bytenr. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:16 +02:00
Christoph Hellwig	65886d2b1f	btrfs: remove search_file_offset_in_bio There is no need to search for a file offset in a bio, it is now always provided in bbio->file_offset (set at bio allocation time since `0d495430db` ("btrfs: set bbio->file_offset in alloc_new_bio")). Just use that with the offset into the bio. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:16 +02:00
Anand Jain	19337f8ea3	btrfs: switch search_file_offset_in_bio to return bool Function search_file_offset_in_bio() finds the file offset in the file_offset_ret, and we use the return value to indicate if it is successful, so use bool. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Anand Jain	da8269a3e9	btrfs: avoid reusing return variable in nested block in btrfs_lookup_bio_sums The function btrfs_lookup_bio_sums() and a nested if statement declare ret respectively as blk_status_t and int. There is no need to store the return value of search_file_offset_in_bio() to ret as this is a one-time call. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Christoph Hellwig	f8c44673e5	btrfs: simplify the btrfs_csum_one_bio calling convention To prepare for further bio submission changes btrfs_csum_one_bio should be able to take all it's arguments from the btrfs_bio structure. It can always use the bbio->inode already, and once the compression code is updated to set ->file_offset that one can be used unconditionally as well instead of looking at the page mapping now that btrfs doesn't allow ordered extents to span discontiguous data ranges. The only slightly tricky bit is the one_ordered flag set by the compressed writes. Replace that one with the driver private bio flag, which gets cleared before the bio is handed off to the block layer so that we don't get in the way of driver use. Note: this leaves an argument and a flag to btrfs_wq_submit_bio unused. But that whole mechanism will be removed in its current form in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:52 +01:00
Christoph Hellwig	4ae2edf12d	btrfs: simplify parameters of btrfs_lookup_bio_sums The csums argument is always NULL now, so remove it and always allocate the csums array in the btrfs_bio. Also pass the btrfs_bio instead of inode + bio to document that this function requires a btrfs_bio and not just any bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:50 +01:00
Christoph Hellwig	103c19723c	btrfs: split the bio submission path into a separate file The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00
Qu Wenruo	97e3823933	btrfs: introduce a bitmap based csum range search function Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksums for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time by allocating temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). It will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00
Qu Wenruo	cb649e81da	btrfs: refactor checksum calculations in btrfs_lookup_csums_range() The refactoring involves the following parts: - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers As we have quite some open-coded calculations, some of them are even split into two assignments just to fit 80 chars limit. - Remove the @csum_size parameter from max_ordered_sum_bytes() Csum size can be fetched from @fs_info. And we will use the csum_size_to_bytes() helper anyway. - Add a comment explaining how we handle the first search result - Use newly introduced helpers to cleanup btrfs_lookup_csums_range() - Move variables declaration to the minimal scope - Never mix number of sectors with bytes There are several locations doing things like: size = min_t(size_t, csum_end - start, max_ordered_sum_bytes(fs_info)); ... size >>= fs_info->sectorsize_bits Or offset = (start - key.offset) >> fs_info->sectorsize_bits; offset *= csum_size; Make sure these variables can only represent BYTES inside the function, by using the above bytes_to_csum_size() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00
Qu Wenruo	280f15cb96	btrfs: remove new_inline argument from btrfs_extent_item_to_extent_map() The argument @new_inline changes the following members of extent_map: - em->compress_type - EXTENT_FLAG_COMPRESSED of em->flags However neither members makes a difference for inline extents: - Inline extent read never use above em members As inside btrfs_get_extent() we directly use the file extent item to do the read. - Inline extents are never to be split Thus code really needs em->compress_type or that flag will never be executed on inlined extents. (btrfs_drop_extent_cache() would be one example) - Fiemap no longer relies on extent maps Recent fiemap optimization makes fiemap to search subvolume tree directly, without using any extent map at all. Thus those members make no difference for inline extents any more. Furthermore such exception without much explanation is really a source of confusion. Thus this patch will completely remove the argument, and always set the involved members, unifying the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:48 +01:00
Josef Bacik	7f0add250f	btrfs: move super_block specific helpers into super.h This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	7c8ede1628	btrfs: move file-item prototypes into their own header Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
David Sterba	43dd529abe	btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:45 +01:00
Josef Bacik	07e81dc944	btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	9b569ea0be	btrfs: move the printk helpers out of ctree.h We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	c7f13d428e	btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	26ce911446	btrfs: make can_nocow_extent nowait compatible If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-29 17:08:26 +02:00
Josef Bacik	2b6433c7f6	btrfs: move btrfs_ordered_sum_size into file-item.c This is defined in ordered-data.h, but is only used in file-item.c. Move this to file-item.c as it doesn't need to be global. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:06 +02:00
Josef Bacik	bd015294af	btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:05 +02:00
Josef Bacik	dbbf49928f	btrfs: remove the wake argument from clear_extent_bits This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:28:04 +02:00
Omar Sandoval	d1f68ba069	btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	1784b7d502	btrfs: handle csum lookup errors properly on reads Currently any error we get while trying to lookup csums during reads shows up as a missing csum, and then on the read completion side we print an error saying there was a csum mismatch and we increase the device corruption count. However we could have gotten an EIO from the lookup. We could also be inside of a memory constrained container and gotten a ENOMEM while trying to do the read. In either case we don't want to make this look like a file system corruption problem, we want to make it look like the actual error it is. Capture any negative value, convert it to the appropriate blk_status_t, free the csum array if we have one and bail. Note: a possible improvement would be to make the relocation code look up the owning inode and see if it's marked as NODATASUM and set EXTENT_NODATASUM there, that way if there's corruption and there isn't a checksum when we want it we can fail here rather than later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:51 +01:00
Josef Bacik	03ddb19d2e	btrfs: make search_csum_tree return 0 if we get -EFBIG We can either fail to find a csum entry at all and return -ENOENT, or we can find a range that is close, but return -EFBIG. In essence these both mean the same thing when we are doing a lookup for a csum in an existing range, we didn't find a csum. We want to treat both of these errors the same way, complain loudly that there wasn't a csum. This currently happens anyway because we do count = search_csum_tree(); if (count <= 0) { // reloc and error handling } However it forces us to incorrectly treat EIO or ENOMEM errors as on disk corruption. Fix this by returning 0 if we get either -ENOENT or -EFBIG from btrfs_lookup_csum() so we can do proper error handling. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:51 +01:00
Omar Sandoval	e331f6b19f	btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() btrfs_csum_one_bio() loops over each filesystem block in the bio while keeping a cursor of its current logical position in the file in order to look up the ordered extent to add the checksums to. However, this doesn't make much sense for compressed extents, as a sector on disk does not correspond to a sector of decompressed file data. It happens to work because: 1) the compressed bio always covers one ordered extent 2) the size of the bio is always less than the size of the ordered extent However, the second point will not always be true for encoded writes. Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that it can assume that the bio only covers one ordered extent. Since we're already changing the signature, let's get rid of the contig parameter and make it implied by the offset parameter, similar to the change we recently made to btrfs_lookup_bio_sums(). Additionally, let's rename nr_sectors to blockcount to make it clear that it's the number of filesystem blocks, not the number of 512-byte sectors. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:50 +01:00
Qu Wenruo	40e7efe057	btrfs: populate extent_map::generation when reading from disk When btrfs_get_extent() tries to get some file extent from disk, it never populates extent_map::generation, leaving the value to be 0. On the other hand, for extent map generated by IO, it will get its generation properly set at finish_ordered_io() finish_ordered_io() \|- unpin_extent_cache(gen = trans->transid) \|- em->generation = gen; [CAUSE] Since extent_map::generation is mostly used by fsync code, and for fsync they only care about modified extents, which all have their em::generation > 0. Thus it's fine to not populate em read from disk for fsync. [CORNER CASE] However autodefrag also relies on em::generation to determine if one extent needs to be defragged. This unpopulated extent_map::generation can prevent the following autodefrag case from working: mkfs.btrfs -f $dev mount $dev $mnt -o autodefrag # initial write to queue the inode for autodefrag xfs_io -f -c "pwrite 0 4k" $mnt/file sync # Real fragmented write xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file sync echo "=== before autodefrag ===" xfs_io -c "fiemap -v" $mnt/file # Drop cache to force em to be read from disk echo 3 > /proc/sys/vm/drop_caches mount -o remount,commit=1 $mnt sleep 3 sync echo "=== After autodefrag ===" xfs_io -c "fiemap -v" $mnt/file umount $mnt The result looks like this: === before autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..15]: 26672..26687 16 0x0 1: [16..31]: 26656..26671 16 0x0 2: [32..47]: 26640..26655 16 0x0 3: [48..63]: 26624..26639 16 0x1 === After autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..15]: 26672..26687 16 0x0 1: [16..31]: 26656..26671 16 0x0 2: [32..47]: 26640..26655 16 0x0 3: [48..63]: 26624..26639 16 0x1 This fragmented 32K will not be defragged by autodefrag. [FIX] To make things less weird, just populate extent_map::generation when reading file extents from disk. This would make above fragmented extents to be properly defragged: == before autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..15]: 26672..26687 16 0x0 1: [16..31]: 26656..26671 16 0x0 2: [32..47]: 26640..26655 16 0x0 3: [48..63]: 26624..26639 16 0x1 === After autodefrag === /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..63]: 26688..26751 64 0x1 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-14 13:13:50 +01:00
Josef Bacik	fc28b25e1f	btrfs: stop accessing ->csum_root directly We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-01-03 15:09:49 +01:00
Josef Bacik	056c831116	btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum root We have a few places where we skip doing csums if we mounted with one of the rescue options that ignores bad csum roots. In the future when there are multiple csum roots it'll be costly to check and see if there are any missing csum roots, so simply add a flag to indicate the fs should skip loading csums in case of errors. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-01-03 15:09:49 +01:00
Josef Bacik	84d2d6c701	btrfs: fix csum assert to check objectid of the root In the future we may have multiple csum roots, so simply check the objectid is for a csum root instead of checking against ->csum_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-01-03 15:09:49 +01:00
Josef Bacik	3212fa14e7	btrfs: drop the _nr from the item helpers Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_(), and then rework the _nr() helpers to be the btrfs_item_() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-01-03 15:09:43 +01:00
Christoph Hellwig	3dcfbcce1b	btrfs: use bvec_kmap_local in btrfs_csum_one_bio Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-10-26 19:08:06 +02:00
Qu Wenruo	c3a3b19bac	btrfs: rename struct btrfs_io_bio to btrfs_bio Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-10-26 19:08:02 +02:00
Qu Wenruo	bbc9a6eb5e	btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling There is a BUG_ON() in btrfs_csum_one_bio() to catch code logic error. It has indeed caught several bugs during subpage development. But the BUG_ON() itself will bring down the whole system which is an overkill. Replace it with a WARN() and exit gracefully, so that it won't crash the whole system while we can still catch the code logic error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-09-17 19:29:38 +02:00
Marcos Paulo de Souza	f8ee80de7b	btrfs: remove unneeded return variable in btrfs_lookup_file_extent We can return from btrfs_search_slot directly which also shows that it follows the same return value convention. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-08-23 13:19:01 +02:00
David Sterba	1a9fd4172d	btrfs: fix typos in comments Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>	2021-06-22 14:11:57 +02:00
Filipe Manana	ea7036de0d	btrfs: fix fsync failure and transaction abort after writes to prealloc extents When doing a series of partial writes to different ranges of preallocated extents with transaction commits and fsyncs in between, we can end up with a checksum items in a log tree. This causes an fsync to fail with -EIO and abort the transaction, turning the filesystem to RO mode, when syncing the log. For this to happen, we need to have a full fsync of a file following one or more fast fsyncs. The following example reproduces the problem and explains how it happens: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt # Create our test file with 2 preallocated extents. Leave a 1M hole # between them to ensure that we get two file extent items that will # never be merged into a single one. The extents are contiguous on disk, # which will later result in the checksums for their data to be merged # into a single checksum item in the csums btree. # $ xfs_io -f \ -c "falloc 0 1M" \ -c "falloc 3M 3M" \ /mnt/foobar # Now write to the second extent and leave only 1M of it as unwritten, # which corresponds to the file range [4M, 5M[. # # Then fsync the file to flush delalloc and to clear full sync flag from # the inode, so that a future fsync will use the fast code path. # # After the writeback triggered by the fsync we have 3 file extent items # that point to the second extent we previously allocated: # # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the # file range [3M, 4M[ # # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers # the file range [4M, 5M[ # # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the # file range [5M, 6M[ # # All these file extent items have a generation of 6, which is the ID of # the transaction where they were created. The split of the original file # extent item is done at btrfs_mark_extent_written() when ordered extents # complete for the file ranges [3M, 4M[ and [5M, 6M[. # $ xfs_io -c "pwrite -S 0xab 3M 1M" \ -c "pwrite -S 0xef 5M 1M" \ -c "fsync" \ /mnt/foobar # Commit the current transaction. This wipes out the log tree created by # the previous fsync. sync # Now write to the unwritten range of the second extent we allocated, # corresponding to the file range [4M, 5M[, and fsync the file, which # triggers the fast fsync code path. # # The fast fsync code path sees that there is a new extent map covering # the file range [4M, 5M[ and therefore it will log a checksum item # covering the range [1M, 2M[ of the second extent we allocated. # # Also, after the fsync finishes we no longer have the 3 file extent # items that pointed to 3 sections of the second extent we allocated. # Instead we end up with a single file extent item pointing to the whole # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the # current transaction ID). This is due to the file extent item merging we # do when completing ordered extents into ranges that point to unwritten # (preallocated) extents. This merging is done at # btrfs_mark_extent_written(). # $ xfs_io -c "pwrite -S 0xcd 4M 1M" \ -c "fsync" \ /mnt/foobar # Now do some write to our file outside the range of the second extent # that we allocated with fallocate() and truncate the file size from 6M # down to 5M. # # The truncate operation sets the full sync runtime flag on the inode, # forcing the next fsync to use the slow code path. It also changes the # length of the second file extent item so that it represents the file # range [3M, 5M[ and not the range [3M, 6M[ anymore. # # Finally fsync the file. Since this is a fsync that triggers the slow # code path, it will remove all items associated to the inode from the # log tree and then it will scan for file extent items in the # fs/subvolume tree that have a generation matching the current # transaction ID, which is 7. This means it will log 2 file extent # items: # # 1) One for the first extent we allocated, covering the file range # [0, 1M[ # # 2) Another for the first 2M of the second extent we allocated, # covering the file range [3M, 5M[ # # When logging the first file extent item we log a single checksum item # that has all the checksums for the entire extent. # # When logging the second file extent item, we also lookup for the # checksums that are associated with the range [0, 2M[ of the second # extent we allocated (file range [3M, 5M[), and then we log them with # btrfs_csum_file_blocks(). However that results in ending up with a log # that has two checksum items with ranges that overlap: # # 1) One for the range [1M, 2M[ of the second extent we allocated, # corresponding to the file range [4M, 5M[, which we logged in the # previous fsync that used the fast code path; # # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second # extents, respectively, corresponding to the files ranges [0, 1M[ # and [3M, 5M[. This one was added during this last fsync that uses # the slow code path and overlaps with the previous one logged by # the previous fast fsync. # # This happens because when logging the checksums for the second # extent, we notice they start at an offset that matches the end of the # checksums item that we logged for the first extent, and because both # extents are contiguous on disk, btrfs_csum_file_blocks() decides to # extend that existing checksums item and append the checksums for the # second extent to this item. The end result is we end up with two # checksum items in the log tree that have overlapping ranges, as # listed before, resulting in the fsync to fail with -EIO and aborting # the transaction, turning the filesystem into RO mode. # $ xfs_io -c "pwrite -S 0xff 0 1M" \ -c "truncate 5M" \ -c "fsync" \ /mnt/foobar fsync: Input/output error After running the example, dmesg/syslog shows the tree checker complained about the checksum items with overlapping ranges and we aborted the transaction: $ dmesg (...) [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610 [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929 [756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160 [756289.564649] inode generation 6 size 5242880 mode 100600 [756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16 [756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53 [756289.567725] extent data disk bytenr 13631488 nr 1048576 [756289.568697] extent data offset 0 nr 1048576 ram 1048576 [756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53 [756289.570682] extent data disk bytenr 0 nr 0 [756289.571363] extent data offset 0 nr 2097152 ram 2097152 [756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53 [756289.573246] extent data disk bytenr 14680064 nr 3145728 [756289.574121] extent data offset 0 nr 2097152 ram 3145728 [756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072 [756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024 [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected [756289.578644] ------------[ cut here ]------------ [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs] [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...) [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1 [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs] [756289.595122] Code: 5d c3 e8 76 60 (...) [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282 [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000 [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000 [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000 [756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000 [756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0 [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [756289.606400] Call Trace: [756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs] [756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs] [756289.608040] submit_one_bio+0x61/0x70 [btrfs] [756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs] [756289.609258] ? free_debug_processing+0x1d5/0x240 [756289.609812] ? __module_address+0x28/0xf0 [756289.610298] ? lock_acquire+0x1a0/0x3e0 [756289.610754] ? lock_acquired+0x19f/0x430 [756289.611220] ? lock_acquire+0x1a0/0x3e0 [756289.611675] do_writepages+0x43/0xf0 [756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100 [756289.612800] __filemap_fdatawrite_range+0xc5/0x100 [756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs] [756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs] [756289.614661] ? finish_wait+0x90/0x90 [756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0 [756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs] [756289.616338] ? lock_acquire+0x1a0/0x3e0 [756289.616801] ? lock_acquired+0x19f/0x430 [756289.617284] ? lock_acquire+0x1a0/0x3e0 [756289.617750] ? lock_release+0x214/0x470 [756289.618221] ? lock_acquired+0x19f/0x430 [756289.618704] ? dput+0x20/0x4a0 [756289.619079] ? dput+0x20/0x4a0 [756289.619452] ? lockref_put_or_lock+0x9/0x30 [756289.619969] ? lock_release+0x214/0x470 [756289.620445] ? lock_release+0x214/0x470 [756289.620924] ? lock_release+0x214/0x470 [756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs] [756289.621982] do_fsync+0x38/0x70 [756289.622395] __x64_sys_fsync+0x10/0x20 [756289.622907] do_syscall_64+0x33/0x80 [756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae [756289.624063] RIP: 0033:0x7f08b27fbb7b [756289.624588] Code: 0f 05 48 3d 00 (...) [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003 [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0 [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001 [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480 [756289.631819] irq event stamp: 0 [756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0 [756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0 [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0 [756289.635606] ---[ end trace 0a039fdc16ff3fef ]--- [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure [756289.637082] BTRFS info (device sdc): forced readonly Having checksum items covering ranges that overlap is dangerous as in some cases it can lead to having extent ranges for which we miss checksums after log replay or getting the wrong checksum item. There were some fixes in the past for bugs that resulted in this problem, and were explained and fixed by the following commits: `27b9a8122f` ("Btrfs: fix csum tree corruption, duplicate and outdated checksums") `b84b8390d6` ("Btrfs: fix file read corruption after extent cloning and fsync") `40e046acbd` ("Btrfs: fix missing data checksums after replaying a log tree") `e289f03ea7` ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents") Fix the issue by making btrfs_csum_file_blocks() taking into account the start offset of the next checksum item when it decides to extend an existing checksum item, so that it never extends the checksum to end at a range that goes beyond the start range of the next checksum item. When we can not access the next checksum item without releasing the path, simply drop the optimization of extending the previous checksum item and fallback to inserting a new checksum item - this happens rarely and the optimization is not significant enough for a log tree in order to justify the extra complexity, as it would only save a few bytes (the size of a struct btrfs_item) of leaf space. This behaviour is only needed when inserting into a log tree because for the regular checksums tree we never have a case where we try to insert a range of checksums that overlap with a range that was previously inserted. A test case for fstests will follow soon. Reported-by: Philipp Fent <fent@in.tum.de> Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ CC: stable@vger.kernel.org # 5.4+ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:31:36 +02:00
Josef Bacik	b86652be7c	btrfs: fix error handling in btrfs_del_csums Error injection stress would sometimes fail with checksums on disk that did not have a corresponding extent. This occurred because the pattern in btrfs_del_csums was while (1) { ret = btrfs_search_slot(); if (ret < 0) break; } ret = 0; out: btrfs_free_path(path); return ret; If we got an error from btrfs_search_slot we'd clear the error because we were breaking instead of goto out. Instead of using goto out, simply handle the cases where we may leave a random value in ret, and get rid of the ret = 0; out: pattern and simply allow break to have the proper error reporting. With this fix we properly abort the transaction and do not commit thinking we successfully deleted the csum. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-05-27 23:30:49 +02:00
Johannes Thumshirn	cea628008f	btrfs: remove duplicated in_range() macro The in_range() macro is defined twice in btrfs' source, once in ctree.h and once in misc.h. Remove the definition in ctree.h and include misc.h in the files depending on it. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Nikolay Borisov	ca4207ae13	btrfs: fix function description formats in file-item.c This fixes following W=1 warnings: fs/btrfs/file-item.c:27: warning: Cannot understand * @inode: the inode we want to update the disk_i_size for on line 27 - I thought it was a doc line fs/btrfs/file-item.c:65: warning: Cannot understand * @inode - the inode we're modifying on line 65 - I thought it was a doc line fs/btrfs/file-item.c:91: warning: Cannot understand * @inode - the inode we're modifying on line 91 - I thought it was a doc line Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
ethanwu	9a66497156	btrfs: correctly calculate item size used when item key collision happens Item key collision is allowed for some item types, like dir item and inode refs, but the overall item size is limited by the nodesize. item size(ins_len) passed from btrfs_insert_empty_items to btrfs_search_slot already contains size of btrfs_item. When btrfs_search_slot reaches leaf, we'll see if we need to split leaf. The check incorrectly reports that split leaf is required, because it treats the space required by the newly inserted item as btrfs_item + item data. But in item key collision case, only item data is actually needed, the newly inserted item could merge into the existing one. No new btrfs_item will be inserted. And split_leaf return EOVERFLOW from following code: if (extend && data_size + btrfs_item_size_nr(l, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info)) return -EOVERFLOW; In most cases, when callers receive EOVERFLOW, they either return this error or handle in different ways. For example, in normal dir item creation the userspace will get errno EOVERFLOW; in inode ref case INODE_EXTREF is used instead. However, this is not the case for rename. To avoid the unrecoverable situation in rename, btrfs_check_dir_item_collision is called in early phase of rename. In this function, when item key collision is detected leaf space is checked: data_size = sizeof(*di) + name_len; if (data_size + btrfs_item_size_nr(leaf, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info)) the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here refers to existing item size, the condition here correctly calculates the needed size for collision case rather than the wrong case above. The consequence of inconsistent condition check between btrfs_check_dir_item_collision and btrfs_search_slot when item key collision happens is that we might pass check here but fail later at btrfs_search_slot. Rename fails and volume is forced readonly [436149.586170] ------------[ cut here ]------------ [436149.586173] BTRFS: Transaction aborted (error -75) [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1 [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286 [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006 [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0 [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472 [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08 [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe [436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000 [436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0 [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [436149.586296] Call Trace: [436149.586311] vfs_rename+0x383/0x920 [436149.586313] ? vfs_rename+0x383/0x920 [436149.586315] do_renameat2+0x4ca/0x590 [436149.586317] __x64_sys_rename+0x20/0x30 [436149.586324] do_syscall_64+0x5a/0x120 [436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [436149.586332] RIP: 0033:0x7fa3133b1d37 [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37 [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0 [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0 [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00 Thanks to Hans van Kranenburg for information about crc32 hash collision tools, I was able to reproduce the dir item collision with following python script. https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run it under a btrfs volume will trigger the abort transaction. It simply creates files and rename them to forged names that leads to hash collision. There are two ways to fix this. One is to simply revert the patch `878f2d2cb3` ("Btrfs: fix max dir item size calculation") to make the condition consistent although that patch is correct about the size. The other way is to handle the leaf space check correctly when collision happens. I prefer the second one since it correct leaf space check in collision case. This fix will not account sizeof(struct btrfs_item) when the item already exists. There are two places where ins_len doesn't contain sizeof(struct btrfs_item), however. 1. extent-tree.c: lookup_inline_extent_backref 2. file-item.c: btrfs_csum_file_blocks to make the logic of btrfs_search_slot more clear, we add a flag search_for_extension in btrfs_path. This flag indicates that ins_len passed to btrfs_search_slot doesn't contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot will use the actual size needed to calculate the required leaf space. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-18 14:50:00 +01:00
Qu Wenruo	6275193ef1	btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs Refactor btrfs_lookup_bio_sums() by: - Remove the @file_offset parameter There are two factors making the @file_offset parameter useless: * For csum lookup in csum tree, file offset makes no sense We only need disk_bytenr, which is unrelated to file_offset * page_offset (file offset) of each bvec is not contiguous. Pages can be added to the same bio as long as their on-disk bytenr is contiguous, meaning we could have pages at different file offsets in the same bio. Thus passing file_offset makes no sense any more. The only user of file_offset is for data reloc inode, we will use a new function, search_file_offset_in_bio(), to handle it. - Extract the csum tree lookup into search_csum_tree() The new function will handle the csum search in csum tree. The return value is the same as btrfs_find_ordered_sum(), returning the number of found sectors which have checksum. - Change how we do the main loop The only needed info from bio is: * the on-disk bytenr * the length After extracting the above info, we can do the search without bio at all, which makes the main loop much simpler: for (cur_disk_bytenr = orig_disk_bytenr; cur_disk_bytenr < orig_disk_bytenr + orig_len; cur_disk_bytenr += count * sectorsize) { /* Lookup csum tree / count = search_csum_tree(fs_info, path, cur_disk_bytenr, search_len, csum_dst); if (!count) { / Csum hole handling */ } } - Use single variable as the source to calculate all other offsets Instead of all different type of variables, we use only one main variable, cur_disk_bytenr, which represents the current disk bytenr. All involved values can be calculated from that variable, and all those variable will only be visible in the inner loop. The above refactoring makes btrfs_lookup_bio_sums() way more robust than it used to be, especially related to the file offset lookup. Now file_offset lookup is only related to data reloc inode, otherwise we don't need to bother file_offset at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:11 +01:00
Qu Wenruo	9e46458a7c	btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums The function btrfs_lookup_bio_sums() is only called for read bios. While btrfs_find_ordered_sum() is to search ordered extent sums, which is only for write path. This means to read a page we either: - Submit read bio if it's not uptodate This means we only need to search csum tree for checksums. - The page is already uptodate It can be marked uptodate for previous read, or being marked dirty. As we always mark page uptodate for dirty page. In that case, we don't need to submit read bio at all, thus no need to search any checksums. Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums(). And since btrfs_lookup_bio_sums() is the only caller for btrfs_find_ordered_sum(), also remove the implementation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:10 +01:00
David Sterba	1201b58b67	btrfs: drop casts of bio bi_sector Since commit `72deb455b5` ("block: remove CONFIG_LBDAF") (5.2) the sector_t type is u64 on all arches and configs so we don't need to typecast it. It used to be unsigned long and the result of sector size shifts were not guaranteed to fit in the type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-09 19:16:05 +01:00
Qu Wenruo	35478d053a	btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums In btrfs_lookup_bio_sums() if the bio is pretty large, we want to start readahead in the csum tree. However the threshold is an immediate number, (PAGE_SIZE * 8), from the initial btrfs merge. The meaning of the value is pretty hard to guess, especially when the immediate number is from the times when 4K sectorsize was the default and only CRC32C was supported. For the most common btrfs setup, CRC32 csum and 4K sectorsize, it means just 32K read would kick readahead, while the csum itself is only 32 bytes in size. Now let's be more reasonable by taking both csum size and node size into consideration. If the csum size for the bio is larger than one leaf, then we kick the readahead. This means for current default btrfs, the threshold will be 16M. This change should not change performance observably, thus this is mostly a readability enhancement. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:14 +01:00
Nikolay Borisov	76aea53796	btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:10 +01:00
Josef Bacik	b9729ce014	btrfs: locking: rip out path->leave_spinning We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:02 +01:00
David Sterba	713cebfb98	btrfs: remove unnecessary local variables for checksum size Remove local variable that is then used just once and replace it with fs_info::csum_size. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:54:00 +01:00
David Sterba	223486c27b	btrfs: switch cached fs_info::csum_size from u16 to u32 The fs_info value is 32bit, switch also the local u16 variables. This leads to a better assembly code generated due to movzwl. This simple change will shave some bytes on x86_64 and release config: text data bss dec hex filename 1090000 17980 14912 1122892 11224c pre/btrfs.ko 1089794 17980 14912 1122686 11217e post/btrfs.ko DELTA: -206 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-12-08 15:53:59 +01:00

1 2 3 4 5

237 Commits