-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmOBCJEACgkQxWXV+ddt
WDu5Nw/+P59ARfAm/4HRId4iL6UKozSMc+blWLeP9KkjcytdAfek0oGe3gZ7NJVK
8VYa93yNneCTkNFLIEpqEduGQjN04dr0odRUXD/kIR8EEtjbgDrH9ZmL47An5wVH
qE8ILlh2+DXk/QLTpjo8n4mm+MJDJYzfz/jVV9vl8ehMahjj1M0/KmO/vNvDbP2s
owWU1FBjX7TV6kHa+SQGqd1HfXS1YUx203I4SDmPj8vSXtysvSOWClT3HO6i6O5S
MSS3Me+rx9eMFMISNghL8I466+lPlGxK14DmLUE4l0kfoKyd4eHQw+ft76D6Twuz
JqjegAGA1nzqDO0XDXb4WPjrPKG8r8Ven2eInF3kncku9GyeEafL+L+nmj7PHsE7
dixWo2TQ9z1Wm/n1NWlU02ZSLdbetUtYTvZczUhevtNzuYUtILihcFZO3+Cp7V4p
R2WwJ5XXdfS8g8Q9kJCOuVd9fZ+3hQvEF1IwWCP9ZZfmIC6/4/uGGFB6TJu7HmZC
trpQYn9l5aP9L9Uq8t+9j+XoDEzQW0tZGpiYI9ypAa5Q5xbw3Ez2JNTbF7YVqQE2
iFDwuuy/X1iNvifniQgdodKVQLK/PcNrlcNb/gPG6cGCWjlTj3SKT9SlrwAgSDZW
pFWFb9NtN3ORjLeCiONo/ZGpZzM9/XQplub+4WuXQXGNJasRIoE=
=Q4JA
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix a regression in nowait + buffered write
- in zoned mode fix endianness when comparing super block generation
- locking and lockdep fixes:
- fix potential sleeping under spinlock when setting qgroup limit
- lockdep warning fixes when btrfs_path is freed after copy_to_user
- do not modify log tree while holding a leaf from fs tree locked
- fix freeing of sysfs files of static features on error
- use kv.alloc for zone map allocation as a fallback to avoid warnings
due to high order allocation
- send, avoid unaligned encoded writes when attempting to clone range
* tag 'for-6.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()
btrfs: do not modify log tree while holding a leaf from fs tree locked
btrfs: use kvcalloc in btrfs_get_dev_zone_info
btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()
btrfs: send: avoid unaligned encoded writes when attempting to clone range
btrfs: zoned: fix missing endianness conversion in sb_write_pointer
btrfs: free btrfs_path before copying subvol info to userspace
btrfs: free btrfs_path before copying fspath to userspace
btrfs: free btrfs_path before copying inodes to userspace
btrfs: free btrfs_path before copying root refs to userspace
btrfs: fix assertion failure and blocking during nowait buffered write
Although kset_unregister() can eventually remove all attribute files,
explicitly rolling back with the matching function makes the code logic
look clearer.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:
WARNING: possible circular locking dependency detected
6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.1/16154 is trying to acquire lock:
ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
but task is already holding lock:
ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-log-00){++++}-{3:3}:
down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
__btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #1 (btrfs-tree-00){++++}-{3:3}:
__lock_release kernel/locking/lockdep.c:5382 [inline]
lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
search_leaf fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
__btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
__btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3097 [inline]
check_prevs_add kernel/locking/lockdep.c:3216 [inline]
validate_chain kernel/locking/lockdep.c:3831 [inline]
__lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
lock_acquire kernel/locking/lockdep.c:5668 [inline]
lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
__mutex_lock_common kernel/locking/mutex.c:603 [inline]
__mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
__btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
__btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
evict+0x2ed/0x6b0 fs/inode.c:664
dispose_list+0x117/0x1e0 fs/inode.c:697
prune_icache_sb+0xeb/0x150 fs/inode.c:896
super_cache_scan+0x391/0x590 fs/super.c:106
do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
shrink_slab_memcg mm/vmscan.c:912 [inline]
shrink_slab+0x388/0x660 mm/vmscan.c:991
shrink_node_memcgs mm/vmscan.c:6088 [inline]
shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
shrink_zones mm/vmscan.c:6355 [inline]
do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
try_charge mm/memcontrol.c:2827 [inline]
charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
__mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
__filemap_add_folio+0x615/0xf80 mm/filemap.c:852
filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
__filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
find_or_create_page include/linux/pagemap.h:612 [inline]
alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
__btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
update_log_root fs/btrfs/tree-log.c:2841 [inline]
btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-log-00);
lock(btrfs-tree-00);
lock(btrfs-log-00);
lock(&delayed_node->mutex);
Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:
1) Modifying the log tree triggers an extent buffer allocation while
holding a write lock on a parent extent buffer from the log tree.
Allocating the pages for an extent buffer, or the extent buffer
struct, can trigger inode eviction and finally the inode eviction
will trigger a release/remove of a delayed node, which requires
taking the delayed node's mutex;
2) Allocating a metadata extent for a log tree can trigger the async
reclaim thread and make us wait for it to release enough space and
unblock our reservation ticket. The reclaim thread can start flushing
delayed items, and that in turn results in the need to lock delayed
node mutexes and in the need to write lock extent buffers of a
subvolume tree - all this while holding a write lock on the parent
extent buffer in the log tree.
So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.
Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.
Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzkaller reported BUG as follows:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
__might_resched.cold+0x222/0x26b
kmem_cache_alloc+0x2e7/0x3c0
update_qgroup_limit_item+0xe1/0x390
btrfs_qgroup_inherit+0x147b/0x1ee0
create_subvol+0x4eb/0x1710
btrfs_mksubvol+0xfe5/0x13f0
__btrfs_ioctl_snap_create+0x2b0/0x430
btrfs_ioctl_snap_create_v2+0x25a/0x520
btrfs_ioctl+0x2a1c/0x5ce0
__x64_sys_ioctl+0x193/0x200
do_syscall_64+0x35/0x80
Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in
btrfs_run_qgroups() later outside of the spinlock context.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.
Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.
The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount -o compress $DEV $MNT
# File foo has a size of 33K, not aligned to the sector size.
xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
# Now clone the first 32K of file bar into foo at offset 0.
xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
# Snapshot the default subvolume and create a full send stream (v2).
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send --compressed-data -f /tmp/test.send $MNT/snap
echo -e "\nFile bar in the original filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
echo -e "\nReceiving stream in a new filesystem..."
btrfs receive -f /tmp/test.send $MNT
echo -e "\nFile bar in the new filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
write bar - offset=32768 length=1024
encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
(...)
This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
(...)
So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
generation is an on-disk __le64 value, so use btrfs_super_generation to
convert it to host endian before comparing it.
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl_get_subvol_info() frees the search path after the userspace
copy from the temp buffer @subvol_info. This can lead to a lock splat
warning.
Fix this by freeing the path before we copy it to userspace.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl_ino_to_path() frees the search path after the userspace copy
from the temp buffer @ipath->fspath. Which potentially can lead to a lock
splat warning.
Fix this by freeing the path before we copy it to userspace.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl_logical_to_ino() frees the search path after the userspace
copy from the temp buffer @inodes. Which potentially can lead to a lock
splat.
Fix this by freeing the path before we copy @inodes to userspace.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f3 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
introduced post-6.0 or which aren't considered serious enough to justify a
-stable backport.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY27xPAAKCRDdBJ7gKXxA
juFXAP4tSmfNDrT6khFhV0l4cS43bluErVNLh32RfXBqse8GYgEA5EPvZkOssLqY
86ejRXFgAArxYC4caiNURUQL+IASvQo=
=YVOx
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"22 hotfixes.
Eight are cc:stable and the remainder address issues which were
introduced post-6.0 or which aren't considered serious enough to
justify a -stable backport"
* tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
docs: kmsan: fix formatting of "Example report"
mm/damon/dbgfs: check if rm_contexts input is for a real context
maple_tree: don't set a new maximum on the node when not reusing nodes
maple_tree: fix depth tracking in maple_state
arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging
fs: fix leaked psi pressure state
nilfs2: fix use-after-free bug of ns_writer on remount
x86/traps: avoid KMSAN bugs originating from handle_bug()
kmsan: make sure PREEMPT_RT is off
Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN
x86/uaccess: instrument copy_from_user_nmi()
kmsan: core: kmsan_in_runtime() should return true in NMI context
mm: hugetlb_vmemmap: include missing linux/moduleparam.h
mm/shmem: use page_mapping() to detect page cache for uffd continue
mm/memremap.c: map FS_DAX device memory as decrypted
Partly revert "mm/thp: carry over dirty bit when thp splits on pmd"
nilfs2: fix deadlock in nilfs_count_free_blocks()
mm/mmap: fix memory leak in mmap_region()
hugetlbfs: don't delete error page from pagecache
maple_tree: reorganize testing to restore module testing
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmNtDwoACgkQxWXV+ddt
WDtcBQ/9HA9lLySbgveEj8taIbe6hXZ3Ry+1dSB/r0btb9e/tlcE7Md1ir3ewcIH
ICfjWkbltE5Xqo50Ll+cdbEt0kgMwP+2jISPUG4bikTprLRPp1q4Gl8H9frYotJL
76xC8rgmITC4ZR/PkYisauC3UJTv8EBnB19GzU+5SFh82ZfxF+XHmHFc5Wzdl8Q8
OObFOiVy28dTYubJc0cId39XceVbqv/uj+F/y5tQSZvhPhDRPZfPWBdW3LHIAMSP
xB4E9Qhbk9NAhFUHjvMwBBRao0q2D6ZO4IViB7y5qAIQOIfk6RJK11hAkeybqO+1
E8ADPY6XBEfM6SA3Bf7X4kz1gjTm/eF8l4lnLZdGT1husbBY4O3Biey0qUjZs+oP
LJTUtS3MJMEnTVoW/saUG3iTTDFFxJA+fbn6hKdNLqpKM6jjDgRx2MavbCNoUcCw
nnEVbCh+Z44xXE9+N7SH4E+ygoiwJwvkLLgYQ+ZaAHd7Wmpzmwnf9yWEiy1t1iv2
dj5bTv9jlZTacK8u/NUl6F/nqAIg5lcbNKAs1bPJ2m34ye5FKD2RPANgdqshNYFC
il7TgQjcnyVw17y0qYpqtLZrDsvTreQgUXeCprTPiTenJ1f72zyF7kHxjk12lHWd
/x22sNoX+uWlpJSW1niutVRdupVPqbwED+Qp0E5UkNaC3GeV/Bw=
=1+3V
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- revert memory optimization for scrub blocks, this misses errors in
2nd and following blocks
- add exception for ENOMEM as reason for transaction abort to not print
stack trace, syzbot has reported many
- zoned fixes:
- fix locking imbalance during scrub
- initialize zones for seeding device
- initialize zones for cloned device structures
- when looking up device, change assertion to a real check as some of
the search parameters can be passed by ioctl, reported by syzbot
- fix error pointer check in self tests
* tag 'for-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix locking imbalance on scrub
btrfs: zoned: initialize device's zone info for seeding
btrfs: zoned: clone zoned device info when cloning a device
Revert "btrfs: scrub: use larger block size for data extent scrub"
btrfs: don't print stack trace when transaction is aborted due to ENOMEM
btrfs: selftests: fix wrong error check in btrfs_free_dummy_root()
btrfs: fix match incorrectly in dev_args_match_device
When psi annotations were added to to btrfs compression reads, the psi
state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was
faulty. A pressure state, once entered, is never left. This results in
incorrectly elevated pressure, which triggers OOM kills.
pflags record the *previous* memstall state when we enter a new one. The
code tried to initialize pflags to 1, and then optimize the leave call
when we either didn't enter a memstall, or were already inside a nested
stall. However, there can be multiple PageWorkingset pages in the bio, at
which point it's that path itself that enters repeatedly and overwrites
pflags. This causes us to miss the exit.
Enter the stall only once if needed, then unwind correctly.
erofs has the same problem, fix that up too. And move the memstall exit
past submit_bio() to restore submit accounting originally added by
b8e24a9300 ("block: annotate refault stalls from IO submission").
Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org
Fixes: 4088a47e78 ("btrfs: add manual PSI accounting for compressed reads")
Fixes: 99486c511f ("erofs: add manual PSI accounting for the compressed address space")
Fixes: 118f3663fb ("block: remove PSI accounting from the bio layer")
Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If we're doing device replace on a zoned filesystem and discover in
scrub_enumerate_chunks() that we don't have to copy the block group it is
unlocked before it gets skipped.
But as the block group hasn't yet been locked before it leads to a locking
imbalance. To fix this simply remove the unlock.
This was uncovered by fstests' testcase btrfs/163.
Fixes: 9283b9e09a ("btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When performing seeding on a zoned filesystem it is necessary to
initialize each zoned device's btrfs_zoned_device_info structure,
otherwise mounting the filesystem will cause a NULL pointer dereference.
This was uncovered by fstests' testcase btrfs/163.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When cloning a btrfs_device, we're not cloning the associated
btrfs_zoned_device_info structure of the device in case of a zoned
filesystem.
Later on this leads to a NULL pointer dereference when accessing the
device's zone_info for instance when setting a zone as active.
This was uncovered by fstests' testcase btrfs/161.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 786672e9e1.
[BUG]
Since commit 786672e9e1 ("btrfs: scrub: use larger block size for data
extent scrub"), btrfs scrub no longer reports errors if the corruption
is not in the first sector of a STRIPE_LEN.
The following script can expose the problem:
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar
umount $mnt
# 13631488 is the logical bytenr of above 8K extent
btrfs-map-logical -l 13631488 -b 4096 $dev
mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1
# Corrupt the 2nd sector of that extent
xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev
mount $dev $mnt
btrfs scrub start -B $mnt
scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7
Scrub started: Mon Nov 7 07:18:27 2022
Status: finished
Duration: 0:00:00
Total to scrub: 536.00MiB
Rate: 0.00B/s
Error summary: no errors found <<<
[CAUSE]
That offending commit enlarges the data extent scrub size from sector
size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated.
But unfortunately the data extent scrub is still heavily relying on the
fact that there is only one scrub_sector per scrub_block.
Thus it will only check the first sector, and ignoring the remaining
sectors.
Furthermore the error reporting is not able to handle multiple sectors
either.
[FIX]
For now just revert the offending commit.
The consequence is just extra memory usage during scrub.
We will need a proper change to make the remaining data scrub path to
handle multiple sectors before we enlarging the data scrub size.
Reported-by: Li Zhang <zhanglikernel@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add ENOMEM among the error codes that don't print stack trace on
transaction abort. We've got several reports from syzbot that detects
stacks as errors but caused by limiting memory. As this is an artificial
condition we don't need to know where exactly the error happens, the
abort and error cleanup will continue like e.g. for EIO.
As the transaction aborts code needs to be inline in a lot of code, the
implementation cases about minimal bloat. The error codes are in a
separate function and the WARN uses the condition directly. This
increases the code size by 571 bytes on release build.
Alternatives considered: add -ENOMEM among the errors, this increases
size by 2340 bytes, various attempts to combine the WARN and helper
calls, increase by 700 or more bytes.
Example syzbot reports (error -12):
- https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e
- https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value
rather than NULL, if error happened, there will be a NULL pointer
dereference:
BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs]
Read of size 8 at addr 000000000000002c by task insmod/258926
CPU: 2 PID: 258926 Comm: insmod Tainted: G W 6.1.0-rc2+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
kasan_report+0xb7/0x140
kasan_check_range+0x145/0x1a0
btrfs_free_dummy_root+0x21/0x50 [btrfs]
btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs]
btrfs_run_sanity_tests+0x65/0x80 [btrfs]
init_btrfs_fs+0xec/0x154 [btrfs]
do_one_initcall+0x87/0x2a0
do_init_module+0xdf/0x320
load_module+0x3006/0x3390
__do_sys_finit_module+0x113/0x1b0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fixes: aaedb55bc0 ("Btrfs: add tests for btrfs_get_extent")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
syzkaller found a failed assertion:
assertion failed: (args->devid != (u64)-1) || args->missing, in fs/btrfs/volumes.c:6921
This can be triggered when we set devid to (u64)-1 by ioctl. In this
case, the match of devid will be skipped and the match of device may
succeed incorrectly.
Patch 562d7b1512 introduced this function which is used to match device.
This function contains two matching scenarios, we can distinguish them by
checking the value of args->missing rather than check whether args->devid
and args->uuid is default value.
Reported-by: syzbot+031687116258450f9853@syzkaller.appspotmail.com
Fixes: 562d7b1512 ("btrfs: handle device lookup with btrfs_dev_lookup_args")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmNj2yMACgkQxWXV+ddt
WDsRPg/+Mgp4lLF6WCUhWNbO7K7EdJ+YEikDr7/35TTUcnpqZ6oBrWiHwwcG4d2S
V7eQLf/yId5zVfSD+aZEOSz8gC6Mh+0CujVdj09BYuDl7fDIEjFaoH38JsAhANFO
uUaqxzgZw2feWpwiEF9P2iwZD8VqUMAELjASjBBZVMs6WCpM6SDQRPDj/IkfI2BN
qgtKB7Im9VYBN92eIKlg6+MQCwuMMXKZRQH3dkPfYGJYQMDRyYrDxoeVWSAf9pGX
Xvb3mEUZEcPQmE6ue78Ny0OGXX2sh7Mvz4cEFBJvFUPi99Iu6TluVBgN0akuMTwZ
oZbV/1Abs+KV+yOICAhE/u7mKkLPsfRZeR4Ly8qjIlMUN12r1MR1BuGOJj750nsi
LLBohtfQ+BQYpEOrJ32MbdxXy6/CBinC6Xqz+J3M+F/AMYREPLaND7Co5YkgWyT4
pViRpgxLV+plP5bizbiXtnXI1h4OMBRx7idAZmeBNFtquHSzgf9psUz+sHI8Wvr2
tAI+6n7RSnUDG/N+p0cJSqZf4RZWevjVJrUS4pko56t9ixK/xPkyVFbYLIdcd3bC
N83tDgNtdBuyuFw3f2Ye+f0BxBhpZx6getQW2W9mb+6ylN5nyHFWmQpDGO5sDec0
KJRR3w8vQ/0+64P2JhjFbYW55CzpmB279qGxemsnGakDweEcs+o=
=Ltzp
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A batch of error handling fixes for resource leaks, fixes for nowait
mode in combination with direct and buffered IO:
- direct IO + dsync + nowait could miss a sync of the file after
write, add handling for this combination
- buffered IO + nowait should not fail with ENOSPC, only blocking IO
could determine that
- error handling fixes:
- fix inode reserve space leak due to nowait buffered write
- check the correct variable after allocation (direct IO submit)
- fix inode list leak during backref walking
- fix ulist freeing in self tests"
* tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix inode reserve space leak due to nowait buffered write
btrfs: fix nowait buffered write returning -ENOSPC
btrfs: remove pointless and double ulist frees in error paths of qgroup tests
btrfs: fix ulist leaks in error paths of qgroup self tests
btrfs: fix inode list leak during backref walking at find_parent_nodes()
btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
btrfs: fix lost file sync on direct IO write with nowait and dsync iocb
btrfs: fix a memory allocation failure test in btrfs_submit_direct
During a nowait buffered write, if we fail to balance dirty pages we exit
btrfs_buffered_write() without releasing the delalloc space reserved for
an extent, resulting in leaking space from the inode's block reserve.
So fix that by releasing the delalloc space for the extent when balancing
dirty pages fails.
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com
Fixes: 965f47aeb5 ("btrfs: make btrfs_buffered_write nowait compatible")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are doing a buffered write in NOWAIT context and we can't reserve
metadata space due to -ENOSPC, then we should return -EAGAIN so that we
retry the write in a context allowed to block and do metadata reservation
with flushing, which might succeed this time due to the allowed flushing.
Returning -ENOSPC while in NOWAIT context simply makes some writes fail
with -ENOSPC when they would likely succeed after switching from NOWAIT
context to blocking context. That is unexpected behaviour and even fio
complains about it with a warning like this:
fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536
fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device
The fio's job config is this:
[global]
bs=64K
ioengine=io_uring
iodepth=1
size=2236962133
nr_files=1
filesize=2236962133
direct=0
runtime=10
fallocate=posix
io_size=2236962133
group_reporting
time_based
[task_0]
rw=randwrite
directory=/mnt/sdi
numjobs=4
So fix this by returning -EAGAIN if we are in NOWAIT context and the
metadata reservation failed with -ENOSPC.
Fixes: 304e45acdb ("btrfs: plumb NOWAIT through the write path")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several places in the qgroup self tests follow the pattern of freeing the
ulist pointer they passed to btrfs_find_all_roots() if the call to that
function returned an error. That is pointless because that function always
frees the ulist in case it returns an error.
Also In some places like at test_multiple_refs(), after a call to
btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots"
pointing to ulists that were freed, because btrfs_qgroup_account_extent()
has freed those ulists, and if after that the next call to
btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots"
ulist again, resulting in a double free.
So remove those calls to reduce the code size and avoid double ulist
free in case of an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests,
if we fail to add the tree ref, remove the extent item or remove the
extent ref, we are returning from the test function without freeing the
"old_roots" ulist that was allocated by the previous calls to
btrfs_find_all_roots(). Fix that by calling ulist_free() before returning.
Fixes: 442244c963 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During backref walking, at find_parent_nodes(), if we are dealing with a
data extent and we get an error while resolving the indirect backrefs, at
resolve_indirect_refs(), or in the while loop that iterates over the refs
in the direct refs rbtree, we end up leaking the inode lists attached to
the direct refs we have in the direct refs rbtree that were not yet added
to the refs ulist passed as argument to find_parent_nodes(). Since they
were not yet added to the refs ulist and prelim_release() does not free
the lists, on error the caller can only free the lists attached to the
refs that were added to the refs ulist, all the remaining refs get their
inode lists never freed, therefore leaking their memory.
Fix this by having prelim_release() always free any attached inode list
to each ref found in the rbtree, and have find_parent_nodes() set the
ref's inode list to NULL once it transfers ownership of the inode list
to a ref added to the refs ulist passed to find_parent_nodes().
Fixes: 86d5f99442 ("btrfs: convert prelimary reference tracking to use rbtrees")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During backref walking, at resolve_indirect_refs(), if we get an error
we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
which frees all the elements in the ulist - however that does not free
any inode lists that may be attached to elements, through the 'aux' field
of a ulist node, so we end up leaking lists if we have any attached to
the unodes.
Fix this by calling free_leaf_list() instead of ulist_free() when we exit
from resolve_indirect_refs(). The static function free_leaf_list() is
moved up for this to be possible and it's slightly simplified by removing
unnecessary code.
Fixes: 3301958b7c ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmNfzNwACgkQxWXV+ddt
WDuC6Q//a72PAq1sjwvQqAcr+OOe3PWnmlwYZCnXxiab5c74Kc7rDhDZcO3m/Qt5
3YTwgK5FT4Y0AI8RN1NXx3+UOAYCWp/TGeBdbPHg35XIYKAnCh4pfql84Uiw1Awz
HbqmSTma7sqVdRMehkKCkd7w4YoyAAsDdyXFQlSFm4ah9WHFZDswBc+m6xQZuWvU
QVQS6wUTxkxuBZp0UComWGBNHiDeDZbga7VqO8UHPYOB394IV2mYP6fh8l0oB/BS
bfKgsHjV9e0S0Ul0oPVADCGCiJcTbdnw3IA+Cje7MSgZ3kds/4Bo5IJWT5QRb94A
yDAFpxc+t3+FgpoKS3/tZK7imXwgpXueiT2bBj+BjDDWD2VUVVBG4QmXYIW6tuqY
vtEFw9+NCAvS2gRetHyXxQshYh/QW//+AZSkuI6/fuPSM+lRG5E0lnDxqrZiOMIo
e6SJOGH3tCmtusL5VSXIQ8DPaLI9PBg4OXChytwmLHwPIusbQOvD5sTDpd99UezB
dLXqZOGGScAc11HU1AFyZfAxTBybUgUxX/xCviJtf7ZOWKdcwiFrzSJOL5upSPz3
8qZTVjrD71mJlEa0Z8wj0Utuu4Psecp0GN+fs5JJxmqsFO0cYApU17OqPZ22+yEV
RU26YNpqurYVarHVER4WxyXYraBYd1Cr6s6bFVDnuZynfiCOYIw=
=3tvc
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes and regression fixes:
- fix a corner case when handling tree-mod-log chagnes in reallocated
notes
- fix crash on raid0 filesystems created with <5.4 mkfs.btrfs that
could lead to division by zero
- add missing super block checksum verification after thawing
filesystem
- handle one more case in send when dealing with orphan files
- fix parameter type mismatch for generation when reading dentry
- improved error handling in raid56 code
- better struct bio packing after recent cleanups"
* tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't use btrfs_chunk::sub_stripes from disk
btrfs: fix type of parameter generation in btrfs_get_dentry
btrfs: send: fix send failure of a subcase of orphan inodes
btrfs: make thaw time super block check to also verify checksum
btrfs: fix tree mod log mishandling of reallocated nodes
btrfs: reorder btrfs_bio for better packing
btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed
btrfs: raid56: properly handle the error when unable to find the missing stripe
When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.
This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().
For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).
So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.
A test case will later be sent for fstests, after this is fixed in Linus'
tree.
Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After allocation 'dip' is tested instead of 'dip->csums'. Fix it.
Fixes: 642c5d34da ("btrfs: allocate the btrfs_dio_private as part of the iomap dio bio")
CC: stable@vger.kernel.org # 5.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are two reports (the earliest one from LKP, a more recent one from
kernel bugzilla) that we can have some chunks with 0 as sub_stripes.
This will cause divide-by-zero errors at btrfs_rmap_block, which is
introduced by a recent kernel patch ac0677348f ("btrfs: merge
calculations for simple striped profiles in btrfs_rmap_block"):
if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
BTRFS_BLOCK_GROUP_RAID10)) {
stripe_nr = stripe_nr * map->num_stripes + i;
stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<<
}
[CAUSE]
From the more recent report, it has been proven that we have some chunks
with 0 as sub_stripes, mostly caused by older mkfs.
It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa
("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk")
which is included in v5.4 btrfs-progs release.
So there would be quite some old filesystems with such 0 sub_stripes.
[FIX]
Just don't trust the sub_stripes values from disk.
We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes
numbers for each profile and that are fixed.
By this, we can keep the compatibility with older filesystems while
still avoid divide-by-zero bugs.
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Viktor Kuzmin <kvaster@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559
Fixes: ac0677348f ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block")
CC: stable@vger.kernel.org # 6.0
Reviewed-by: Su Yue <glass@fydeos.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type of parameter generation has been u32 since the beginning,
however all callers pass a u64 generation, so unify the types to prevent
potential loss.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9ed0a72e5b ("btrfs: send: fix failures when processing inodes with
no links") tries to fix all incremental send cases of orphan inodes the
send operation will meet. However, there's still a bug causing the corner
subcase fails with a ENOENT error.
Here's shortened steps of that subcase:
$ btrfs subvolume create vol
$ touch vol/foo
$ btrfs subvolume snapshot -r vol snap1
$ btrfs subvolume snapshot -r vol snap2
# Turn the second snapshot to RW mode and delete the file while
# holding an open file descriptor on it
$ btrfs property set snap2 ro false
$ exec 73<snap2/foo
$ rm snap2/foo
# Set the second snapshot back to RO mode and do an incremental send
# with an unusal reverse order
$ btrfs property set snap2 ro true
$ btrfs send -p snap2 snap1 > /dev/null
At subvol snap1
ERROR: send ioctl failed with -2: No such file or directory
It's subcase 3 of BTRFS_COMPARE_TREE_CHANGED in the commit 9ed0a72e5b
("btrfs: send: fix failures when processing inodes with no links"). And
it's not a common case. We still have not met it in the real world.
Theoretically, this case can happen in a batch cascading snapshot backup.
In cascading backups, the receive operation in the middle may cause orphan
inodes to appear because of the open file descriptors on the snapshot files
during receiving. And if we don't do the batch snapshot backups in their
creation order, then we can have an inode, which is an orphan in the parent
snapshot but refers to a file in the send snapshot. Since an orphan inode
has no paths, the send operation will fail with a ENOENT error if it
tries to generate a path for it.
In that patch, this subcase will be treated as an inode with a new
generation. However, when the routine tries to delete the old paths in
the parent snapshot, the function process_all_refs() doesn't check whether
there are paths recorded or not before it calls the function
process_recorded_refs(). And the function process_recorded_refs() try
to get the first path in the parent snapshot in the beginning. Since it has
no paths in the parent snapshot, the send operation fails.
To fix this, we can easily put a link count check to avoid entering the
deletion routine like what we do a link count check to avoid creating a
new one. Moreover, we can assume that the function process_all_refs()
can always collect references to process because we know it has a
positive link count.
Fixes: 9ed0a72e5b ("btrfs: send: fix failures when processing inodes with no links")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previous commit a05d3c9153 ("btrfs: check superblock to ensure the fs
was not modified at thaw time") only checks the content of the super
block, but it doesn't really check if the on-disk super block has a
matching checksum.
This patch will add the checksum verification to thaw time superblock
verification.
This involves the following extra changes:
- Export btrfs_check_super_csum()
As we need to call it in super.c.
- Change the argument list of btrfs_check_super_csum()
Instead of passing a char *, directly pass struct btrfs_super_block *
pointer.
- Verify that our checksum type didn't change before checking the
checksum value, like it's done at mount time
Fixes: a05d3c9153 ("btrfs: check superblock to ensure the fs was not modified at thaw time")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have been seeing the following panic in production
kernel BUG at fs/btrfs/tree-mod-log.c:677!
invalid opcode: 0000 [#1] SMP
RIP: 0010:tree_mod_log_rewind+0x1b4/0x200
RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293
RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000
RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00
RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001
R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00
FS: 00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
btrfs_get_old_root+0x12b/0x420
btrfs_search_old_slot+0x64/0x2f0
? tree_mod_log_oldest_root+0x3d/0xf0
resolve_indirect_ref+0xfd/0x660
? ulist_alloc+0x31/0x60
? kmem_cache_alloc_trace+0x114/0x2c0
find_parent_nodes+0x97a/0x17e0
? ulist_alloc+0x30/0x60
btrfs_find_all_roots_safe+0x97/0x150
iterate_extent_inodes+0x154/0x370
? btrfs_search_path_in_tree+0x240/0x240
iterate_inodes_from_logical+0x98/0xd0
? btrfs_search_path_in_tree+0x240/0x240
btrfs_ioctl_logical_to_ino+0xd9/0x180
btrfs_ioctl+0xe2/0x2ec0
? __mod_memcg_lruvec_state+0x3d/0x280
? do_sys_openat2+0x6d/0x140
? kretprobe_dispatcher+0x47/0x70
? kretprobe_rethook_handler+0x38/0x50
? rethook_trampoline_handler+0x82/0x140
? arch_rethook_trampoline_callback+0x3b/0x50
? kmem_cache_free+0xfb/0x270
? do_sys_openat2+0xd5/0x140
__x64_sys_ioctl+0x71/0xb0
do_syscall_64+0x2d/0x40
Which is this code in tree_mod_log_rewind()
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
This occurs because we replay the nodes in order that they happened, and
when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot,
starting at 0. 'n' here is the number of items in this block, which in
this case was 1, but we had 2 REMOVE_WHILE_FREEING operations.
The actual root cause of this was that we were replaying operations for
a block that shouldn't have been replayed. Consider the following
sequence of events
1. We have an already modified root, and we do a btrfs_get_tree_mod_seq().
2. We begin removing items from this root, triggering KEY_REPLACE for
it's child slots.
3. We remove one of the 2 children this root node points to, thus triggering
the root node promotion of the remaining child, and freeing this node.
4. We modify a new root, and re-allocate the above node to the root node of
this other root.
The tree mod log looks something like this
logical 0 op KEY_REPLACE (slot 1) seq 2
logical 0 op KEY_REMOVE (slot 1) seq 3
logical 0 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 4
logical 4096 op LOG_ROOT_REPLACE (old logical 0) seq 5
logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 1) seq 6
logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 7
logical 0 op LOG_ROOT_REPLACE (old logical 8192) seq 8
>From here the bug is triggered by the following steps
1. Call btrfs_get_old_root() on the new_root.
2. We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is
currently logical 0.
3. tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which
gives us the KEY_REPLACE seq 2, and since that's not a
LOG_ROOT_REPLACE we incorrectly believe that we don't have an old
root, because we expect that the most recent change should be a
LOG_ROOT_REPLACE.
4. Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE,
so we don't set old_root, we simply use our existing extent buffer.
5. Since we're using our existing extent buffer (logical 0) we call
tree_mod_log_search(0) in order to get the newest change to start the
rewind from, which ends up being the LOG_ROOT_REPLACE at seq 8.
6. Again since we didn't find an old_root we simply clone logical 0 at
it's current state.
7. We call tree_mod_log_rewind() with the cloned extent buffer.
8. Set n = btrfs_header_nritems(logical 0), which would be whatever the
original nritems was when we COWed the original root, say for this
example it's 2.
9. We start from the newest operation and work our way forward, so we
see LOG_ROOT_REPLACE which we ignore.
10. Next we see KEY_REMOVE_WHILE_FREEING for slot 0, which triggers the
BUG_ON(tm->slot < n), because it expects if we've done this we have a
completely empty extent buffer to replay completely.
The correct thing would be to find the first LOG_ROOT_REPLACE, and then
get the old_root set to logical 8192. In fact making that change fixes
this particular problem.
However consider the much more complicated case. We have a child node
in this tree and the above situation. In the above case we freed one
of the child blocks at the seq 3 operation. If this block was also
re-allocated and got new tree mod log operations we would have a
different problem. btrfs_search_old_slot(orig root) would get down to
the logical 0 root that still pointed at that node. However in
btrfs_search_old_slot() we call tree_mod_log_rewind(buf) directly. This
is not context aware enough to know which operations we should be
replaying. If the block was re-allocated multiple times we may only
want to replay a range of operations, and determining what that range is
isn't possible to determine.
We could maybe solve this by keeping track of which root the node
belonged to at every tree mod log operation, and then passing this
around to make sure we're only replaying operations that relate to the
root we're trying to rewind.
However there's a simpler way to solve this problem, simply disallow
reallocations if we have currently running tree mod log users. We
already do this for leaf's, so we're simply expanding this to nodes as
well. This is a relatively uncommon occurrence, and the problem is
complicated enough I'm worried that we will still have corner cases in
the reallocation case. So fix this in the most straightforward way
possible.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 3.3+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After changes in commit 917f32a235 ("btrfs: give struct btrfs_bio a
real end_io handler") the layout of btrfs_bio can be improved. There
are two holes and the structure size is 264 bytes on release build. By
reordering the iterator we can get rid of the holes and the size is 256
bytes which fits to slabs much better.
Final layout:
struct btrfs_bio {
unsigned int mirror_num; /* 0 4 */
struct bvec_iter iter; /* 4 20 */
u64 file_offset; /* 24 8 */
struct btrfs_device * device; /* 32 8 */
u8 * csum; /* 40 8 */
u8 csum_inline[64]; /* 48 64 */
/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
btrfs_bio_end_io_t end_io; /* 112 8 */
void * private; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct work_struct end_io_work; /* 128 32 */
struct bio bio; /* 160 96 */
/* size: 256, cachelines: 4, members: 10 */
};
Fixes: 917f32a235 ("btrfs: give struct btrfs_bio a real end_io handler")
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if full_stripe_write() failed to allocate the pages for
parity, it will call __free_raid_bio() first, then return -ENOMEM.
But some caller of full_stripe_write() will also call __free_raid_bio()
again, this would cause double freeing.
And it's not a logically sound either, normally we should either free
the memory at the same level where we allocated it, or let endio to
handle everything.
So this patch will solve the double freeing by make
raid56_parity_write() to handle the error and free the rbio.
Just like what we do in raid56_parity_recover().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In raid56_alloc_missing_rbio(), if we can not determine where the
missing device is inside the full stripe, we just BUG_ON().
This is not necessary especially the only caller inside scrub.c is
already properly checking the return value, and will treat it as a
memory allocation failure.
Fix the error handling by:
- Add an extra warning for the reason
Although personally speaking it may be better to be an ASSERT().
- Properly free the allocated rbio
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmNNTxsACgkQxWXV+ddt
WDs7QA//WaEPFWO/086pWBhlJF8k+QCNwM/EEKQL5x4hxC6pEGbHk04q63IY3XKh
B9WEoxTfkxpHz8p+p9wDVRJl1Fdby/UFc1l2xTLcU273wL4Iweqf00N4WyEYmqQ3
DtZcYmh4r7gZ9cTuHk8Ex+llhqAUiN7mY3FoEU8naZCE+Fdn/h3T8DT79V2XLgzv
4f46ci0ao74o30EE7vc/Yw3gr1ouJJ4Ajw/UCEXUVC9tWOLcDNE6501AshT/ozDp
m2tljY630QIayaMjtR+HCJHmdmB5bNGdE01Cssqc8M+M7AtQKvf+A/nQNTiI0UfK
6ODdukvteTEEVKL2XHHkW6RWzR1rfhT1JOrl3YRKZwAKYsURXI/t+2UIjZtVstY9
GRb3YGBDVggtbjyXxC04i4WyF3RoHRehGiF/G303BBGFMXfgZ17rvSp7DfL9KLcc
VNycW17CcQLVZXueNWJrNSu2dQ0X8Lx0X+OTcsxRkNCJW+JQHffDl/TwMt0GtRoO
Vhwjp8vUKuJDZbjvGXg0ZKrmk0T12+L8ubt5o5fQtMiFf+RGq77xzI1112ZIwsL0
OtGOD3ShgKDvz24HoxSAVTbHq+/s+bmhIL/xU4QAeol3sOVPfx6b+KqcmTyG9E9u
+gbqB9js/2vbDFNtmhOV8Fv1HbGT8bwtMCIlq5CzsiX+aT5rT88=
=aPaQ
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fiemap fixes:
- add missing path cache update
- fix processing of delayed data and tree refs during backref
walking, this could lead to reporting incorrect extent sharing
- fix extent range locking under heavy contention to avoid deadlocks
- make it possible to test send v3 in debugging mode
- update links in MAINTAINERS
* tag 'for-6.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: update btrfs website links and files
btrfs: ignore fiemap path cache if we have multiple leaves for a data extent
btrfs: fix processing of delayed tree block refs during backref walking
btrfs: fix processing of delayed data refs during backref walking
btrfs: delete stale comments after merge conflict resolution
btrfs: unlock locked extent area if we have contention
btrfs: send: update command for protocol version check
btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUG
btrfs: add missing path cache update during fiemap
The path cache used during fiemap used to determine the sharedness of
extent buffers in a path from a leaf containing a file extent item
pointing to our data extent up to the root node of the tree, is meant to
be used for a single path. Having a single path is by far the most common
case, and therefore worth to optimize for, but it's possible to actually
have multiple paths because we have 2 or more leaves.
If we have multiple leaves, the 'level' variable keeps getting incremented
in each iteration of the while loop at btrfs_is_data_extent_shared(),
which means we will treat the second leaf in the 'tmp' ulist as a level 1
node, and so forth. In the worst case this can lead to getting a level
greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
WARN_ON_ONCE() in the functions to lookup from or store in the path cache
(lookup_backref_shared_cache() and store_backref_shared_cache()). If the
current level never goes beyond 8, due to shared nodes in the paths and
a fs tree height smaller than 8, it can still result in incorrectly
marking one leaf as shared because some other leaf is shared and is stored
one level below that other leaf, as when storing a true sharedness value
in the cache results in updating the sharedness to true of all entries in
the cache below the current level.
Having multiple leaves happens in a case like the following:
- We have a file extent item point to data extent at bytenr X, for
a file range [0, 1M[ for example;
- At this moment we have an extent data ref for the extent, with
an offset of 0 and a count of 1;
- A write into the middle of the extent happens, file range [64K, 128K)
so the file extent item is split into two (at btrfs_drop_extents()):
1) One for file range [0, 64K), with a length (num_bytes field) of
64K and an extent offset of 0;
2) Another one for file range [128K, 1M), with a length of 896K
(1M - 128K) and an extent offset of 128K.
- At this moment the two file extent items are located in the same
leaf;
- A new file extent item for the range [64K, 128K), pointing to a new
data extent, is inserted in the leaf. This results in a leaf split
and now those two file extent items pointing to data extent X end
up located in different leaves;
- Once delayed refs are run, we still have a single extent data ref
item for our data extent at bytenr X, for offset 0, but now with a
count of 2 instead of 1;
- So during fiemap, at btrfs_is_data_extent_shared(), after we call
find_parent_nodes() for the data extent, we get two leaves, since
we have two file extent items point to data extent at bytenr X that
are located in two different leaves.
So skip the use of the path cache when we get more than one leaf.
Fixes: 12a824dc67 ("btrfs: speedup checking for extent sharedness during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During backref walking, when processing a delayed reference with a type of
BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
1) We are accessing the delayed references extent_op, and its key, without
the protection of the delayed ref head's lock;
2) If there's no extent op for the delayed ref head, we end up with an
uninitialized key in the stack, variable 'tmp_op_key', and then pass
it to add_indirect_ref(), which adds the reference to the indirect
refs rb tree.
This is wrong, because indirect references should have a NULL key
when we don't have access to the key, and in that case they should be
added to the indirect_missing_keys rb tree and not to the indirect rb
tree.
This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
from freeing an extent buffer, therefore with a count of -1, it will
not cancel out the corresponding reference we have in the extent tree
(with a count of 1), since both references end up in different rb
trees.
When using fiemap, where we often need to check if extents are shared
through shared subtrees resulting from snapshots, it means we can
incorrectly report an extent as shared when it's no longer shared.
However this is temporary because after the transaction is committed
the extent is no longer reported as shared, as running the delayed
reference results in deleting the tree block reference from the extent
tree.
Outside the fiemap context, the result is unpredictable, as the key was
not initialized but it's used when navigating the rb trees to insert
and search for references (prelim_ref_compare()), and we expect all
references in the indirect rb tree to have valid keys.
The following reproducer triggers the second bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
# With a compressed 128M file we get a tree height of 2 (level 1 root).
xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
btrfs subvolume snapshot $MNT $MNT/snap
# Fiemap should output 0x2008 in the flags column.
# 0x2000 means shared extent
# 0x8 means encoded extent (because it's compressed)
echo
echo "fiemap after snapshot, range [120M, 120M + 128K):"
xfs_io -c "fiemap -v 120M 128K" $MNT/foo
echo
# Overwrite one extent and fsync to flush delalloc and COW a new path
# in the snapshot's tree.
#
# After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
# BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
# buffer in the path.
#
# In the extent tree we have inline references of type
# BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
# buffers, so they should cancel each other, and the extent buffers in
# the fs tree should no longer be considered as shared.
#
echo "Overwriting file range [120M, 120M + 128K)..."
xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
xfs_io -c "fsync" $MNT/snap/foo
# Fiemap should output 0x8 in the flags column. The extent in the range
# [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
# tree.
echo
echo "fiemap after overwrite range [120M, 120M + 128K):"
xfs_io -c "fiemap -v 120M 128K" $MNT/foo
echo
umount $MNT
Running it before this patch:
$ ./test.sh
(...)
wrote 134217728/134217728 bytes at offset 0
128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
fiemap after snapshot, range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x2008
Overwriting file range [120M, 120M + 128K)...
wrote 131072/131072 bytes at offset 125829120
128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
fiemap after overwrite range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x2008
The extent in the range [120M, 120M + 128K) is still reported as shared
(0x2000 bit set) after overwriting that range and flushing delalloc, which
is not correct - an entire path was COWed in the snapshot's tree and the
extent is now only referenced by the original fs tree.
Running it after this patch:
$ ./test.sh
(...)
wrote 134217728/134217728 bytes at offset 0
128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
fiemap after snapshot, range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x2008
Overwriting file range [120M, 120M + 128K)...
wrote 131072/131072 bytes at offset 125829120
128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
fiemap after overwrite range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x8
Now the extent is not reported as shared anymore.
So fix this by passing a NULL key pointer to add_indirect_ref() when
processing a delayed reference for a tree block if there's no extent op
for our delayed ref head with a defined key. Also access the extent op
only after locking the delayed ref head's lock.
The reproducer will be converted later to a test case for fstests.
Fixes: 86d5f99442 ("btrfs: convert prelimary reference tracking to use rbtrees")
Fixes: a6dbceafb9 ("btrfs: Remove unused op_key var from add_delayed_refs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When processing delayed data references during backref walking and we are
using a share context (we are being called through fiemap), whenever we
find a delayed data reference for an inode different from the one we are
interested in, then we immediately exit and consider the data extent as
shared. This is wrong, because:
1) This might be a DROP reference that will cancel out a reference in the
extent tree;
2) Even if it's an ADD reference, it may be followed by a DROP reference
that cancels it out.
In either case we should not exit immediately.
Fix this by never exiting when we find a delayed data reference for
another inode - instead add the reference and if it does not cancel out
other delayed reference, we will exit early when we call
extent_is_shared() after processing all delayed references. If we find
a drop reference, then signal the code that processes references from
the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
immediately if it finds there a reference for another inode, since we
have delayed drop references that may cancel it out. In this later case
we exit once we don't have references in the rb trees that cancel out
each other and have two references for different inodes.
Example reproducer for case 1):
$ cat test-1.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite 0 64K" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
echo
echo "fiemap after cloning:"
xfs_io -c "fiemap -v" $MNT/foo
rm -f $MNT/bar
echo
echo "fiemap after removing file bar:"
xfs_io -c "fiemap -v" $MNT/foo
umount $MNT
Running it before this patch, the extent is still listed as shared, it has
the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
$ ./test-1.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
Example reproducer for case 2):
$ cat test-2.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite 0 64K" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
# Flush delayed references to the extent tree and commit current
# transaction.
sync
echo
echo "fiemap after cloning:"
xfs_io -c "fiemap -v" $MNT/foo
rm -f $MNT/bar
echo
echo "fiemap after removing file bar:"
xfs_io -c "fiemap -v" $MNT/foo
umount $MNT
Running it before this patch, the extent is still listed as shared, it has
the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
$ ./test-2.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
After this patch, after deleting bar in both tests, the extent is not
reported with the 0x2000 flag anymore, it gets only the flag 0x1
(which is FIEMAP_EXTENT_LAST):
$ ./test-1.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x1
$ ./test-2.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x1
These tests will later be converted to a test case for fstests.
Fixes: dc046b10c8 ("Btrfs: make fiemap not blow when you have lots of snapshots")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two comments in btrfs_cache_block_group that I left when
resolving conflict between commits ced8ecf026 "btrfs: fix space cache
corruption and potential double allocations" and 527c490f44 "btrfs:
delete btrfs_wait_space_cache_v1_finished".
The former reworked the caching logic to wait until the caching ends in
btrfs_cache_block_group while the latter only open coded the waiting.
Both removed btrfs_wait_space_cache_v1_finished, the correct code is
with the waiting and returning error. Thus the conflict resolution was
OK.
Signed-off-by: David Sterba <dsterba@suse.com>
In production we hit the following deadlock
task 1 task 2 task 3
------ ------ ------
fiemap(file) falloc(file) fsync(file)
write(0, 1MiB)
btrfs_commit_transaction()
wait_on(!pending_ordered)
lock(512MiB, 1GiB)
start_transaction
wait_on_transaction
lock(0, 1GiB)
wait_extent_bit(512MiB)
task 4
------
finish_ordered_extent(0, 1MiB)
lock(0, 1MiB)
**DEADLOCK**
This occurs because when task 1 does it's lock, it locks everything from
0-512MiB, and then waits for the 512MiB chunk to unlock. task 2 will
never unlock because it's waiting on the transaction commit to happen,
the transaction commit is waiting for the outstanding ordered extents,
and then the ordered extent thread is blocked waiting on the 0-1MiB
range to unlock.
To fix this we have to clear anything we've locked so far, wait for the
extent_state that we contended on, and then try to re-lock the entire
range again.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For a protocol and command compatibility we have a helper that hasn't
been updated for v3 yet. We use it for verity so update where necessary.
Fixes: 38622010a6 ("btrfs: send: add support for fs-verity")
Signed-off-by: David Sterba <dsterba@suse.com>
We haven't finalized send stream v3 yet, so gate the send stream version
behind CONFIG_BTRFS_DEBUG as we want some way to test it.
The original verity send did not check the protocol version, so add that
actual protection as well.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY0DP2AAKCRBZ7Krx/gZQ
6/+qAQCEGQWpcC5MB17zylaX7gqzhgAsDrwtpevlno3aIv/1pQD/YWr/E8tf7WTW
ERXRXMRx1cAzBJhUhVgIY+3ANfU2Rg4=
=cko4
-----END PGP SIGNATURE-----
Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs tmpfile updates from Al Viro:
"Miklos' ->tmpfile() signature change; pass an unopened struct file to
it, let it open the damn thing. Allows to add tmpfile support to FUSE"
* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse: implement ->tmpfile()
vfs: open inside ->tmpfile()
vfs: move open right after ->tmpfile()
vfs: make vfs_tmpfile() static
ovl: use vfs_tmpfile_open() helper
cachefiles: use vfs_tmpfile_open() helper
cachefiles: only pass inode to *mark_inode_inuse() helpers
cachefiles: tmpfile error handling cleanup
hugetlbfs: cleanup mknod and tmpfile
vfs: add vfs_tmpfile_open() helper
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...