Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().
At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.
==================================================================
BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1d6/0x29e lib/dump_stack.c:118
print_address_description+0x66/0x620 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report+0x132/0x1d0 mm/kasan/report.c:530
btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x44840a
RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
Allocated by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
kmalloc_node include/linux/slab.h:577 [inline]
kvmalloc_node+0x81/0x110 mm/util.c:574
kvmalloc include/linux/mm.h:757 [inline]
kvzalloc include/linux/mm.h:765 [inline]
btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x113/0x200 mm/slab.c:3756
deactivate_locked_super+0xa7/0xf0 fs/super.c:335
btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff8880878e0000
which belongs to the cache kmalloc-16k of size 16384
The buggy address is located 1704 bytes inside of
16384-byte region [ffff8880878e0000, ffff8880878e4000)
The buggy address belongs to the page:
page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfffe0000010200(slab|head)
raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.
Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.
There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image. Fail the mount as this needs a closer
look what is actually wrong.
As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:
WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
panic+0x347/0x7c0 kernel/panic.c:231
__warn.cold+0x20/0x46 kernel/panic.c:600
report_bug+0x1bd/0x210 lib/bug.c:198
handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
btrfs_fill_super fs/btrfs/super.c:1316 [inline]
btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).
CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By doing so we can associate the sequence counter to the chunk_mutex
for lockdep purposes (compiled-out otherwise), the mutex is otherwise
used on the write side.
Also avoid explicitly disabling preemption around the write region as it
will now be done automatically by the seqcount machinery based on the
lock type.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:
[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.
After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:
>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156
Which corresponds to:
int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;
while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)
So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:
$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()
for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)
$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$
So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.
Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:
$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *
mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass
if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)
fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'
return ret
def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)
$ drgn dump_block_groups.py
block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6
block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6
block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6
block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6
block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6
block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6
block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$
So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.
So after figuring that out it became obvious why the hang happens:
1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;
2) Task B starts a device replace with /dev/sdd as the source device;
3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;
4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.
It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.
As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;
4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;
5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().
Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().
At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.
This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.
This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.
For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:
1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;
2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;
3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;
4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();
5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().
And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.
While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.
So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.
This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Many things can happen after the device is scanned and before the device
is mounted. One such thing is losing the BTRFS_MAGIC on the device.
If it happens we still won't free that device from the memory and cause
the userland confusion.
For example: As the BTRFS_IOC_DEV_INFO still carries the device path
which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
device which does not belong to the filesystem anymore:
$ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
$ wipefs -a /dev/sdb
# /dev/sdb does not contain magic signature
$ mount -o degraded /dev/sda /btrfs
$ btrfs fi show -m
Label: none uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
Total devices 2 FS bytes used 128.00KiB
devid 1 size 3.00GiB used 571.19MiB path /dev/sda
devid 2 size 3.00GiB used 571.19MiB path /dev/sdb
We need to distinguish the missing signature and invalid superblock, so
add a specific error code ENODATA for that. This also fixes failure of
fstest btrfs/198.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I noticed when fixing device stats for seed devices that we simply threw
away the return value from btrfs_search_slot(). This is because we may
not have stat items, but we could very well get an error, and thus miss
reporting the error up the chain.
Fix this by returning ret if it's an actual error, and then stop trying
to init the rest of the devices stats and return the error up the chain.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recently started recording device stats across the fleet, and noticed
a large increase in messages such as this
BTRFS warning (device dm-0): get dev_stats failed, not yet valid
on our tiers that use seed devices for their root devices. This is
because we do not initialize the device stats for any seed devices if we
have a sprout device and mount using that sprout device. The basic
steps for reproducing are:
$ mkfs seed device
$ mount seed device
# fill seed device
$ umount seed device
$ btrfstune -S 1 seed device
$ mount seed device
$ btrfs device add -f sprout device /mnt/wherever
$ umount /mnt/wherever
$ mount sprout device /mnt/wherever
$ btrfs device stats /mnt/wherever
This will fail with the above message in dmesg.
Fix this by iterating over the fs_devices->seed if they exist in
btrfs_init_dev_stats. This fixed the problem and properly reports the
stats for both devices.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_device_init_dev_stats ]
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not have a common exit block and returns immediatelly
so there's no point having the goto. Remove the two cases.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can check the argument value directly, no need for the temporary
variable.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a mounted sprout filesystem, all threads now are using the
sprout::device_list_mutex, and this is the only code using the
seed::device_list_mutex. This patch converts to use the sprouts
fs_info->fs_devices->device_list_mutex.
The same reasoning holds true here, that device delete is holding
the sprout::device_list_mutex.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
argument to indicate whether to free all devices or just one device.
Export btrfs_sysfs_remove_device() as device operations outside of
sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
btrfs_sysfs_remove_devices_dir() is renamed to
btrfs_sysfs_remove_fs_devices() to suite its new role.
Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
so it is redeclared s static. And the same function had to be moved
before its first caller.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we add a device we need to add it to sysfs, so instead of using the
btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
add a device or all of fs_devices, call the helper function directly
btrfs_sysfs_add_device() and thus make it non-static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.
The current message is too long, so drop the unnecessary UUID and add
process name and PID.
While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent
CC: stable@vger.kernel.org # 4.19+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a flag bit for exclusive operation, use a variable to
store which exclusive operation is being performed. Introduce an API
to start and finish an exclusive operation.
This would enable another way for tools to check which operation is
running on why starting an exclusive operation failed. The followup
patch adds a sysfs_notify() to alert userspace when the state changes, so
userspace can perform select() on it to get notified of the change.
This would enable us to enqueue a command which will wait for current
exclusive operation to complete before issuing the next exclusive
operation. This has been done synchronously as opposed to a background
process, or else error collection (if any) will become difficult.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of opencoding filemap_write_and_wait simply call syncblockdev as
it makes it abundantly clear what's going on and why this is used. No
semantics changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following the refactor of btrfs_free_stale_devices in
7bcb8164ad ("btrfs: use device_list_mutex when removing stale devices")
fs_devices are freed after they have been iterated by the inner
list_for_each so the use-after-free fixed by introducing the break in
fd649f10c3 ("btrfs: Fix use-after-free when cleaning up fs_devs with
a single stale device") is no longer necessary. Just remove it
altogether. No functional changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Invert unlocked to locked and exploit the fact it can only ever be
modified if we are adding a new device to a seed filesystem. This allows
to simplify the check in error: label. No semantics changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When adding a new device there's a mandatory check to see if a device is
being duplicated to the filesystem it's added to. Since this is a
read-only operations not necessary to take device_list_mutex and can simply
make do with an rcu-readlock.
Using just RCU is safe because there won't be another device add delete
running in parallel as btrfs_init_new_device is called only from
btrfs_ioctl_add_dev.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex. As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_prepare_sprout is called when the first rw device is added to a
seed filesystem. This means the filesystem can't have its alloc_list
be non-empty, since seed filesystems are read only. Simply remove the
code altogether.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Without good understanding of how seed devices works it's hard to grok
some of what the code in open_seed_devices or btrfs_prepare_sprout does.
Add comments hopefully reducing some of the cognitive load.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While this patch touches a bunch of files the conversion is
straighforward. Instead of using the implicit linked list anchored at
btrfs_fs_devices::seed the code is switched to using
list_for_each_entry.
Previous patches in the series already factored out code that processed
both main and seed devices so in those cases the factored out functions
are called on the main fs_devices and then on every seed dev inside
list_for_each_entry.
Using list api also allows to simplify deletion from the seed dev list
performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
substituting a while() loop with a simple list_del_init.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It makes no sense to have sysfs-related routines be responsible for
properly initialising the fs_info pointer of struct btrfs_fs_device.
Instead this can be streamlined by making it the responsibility of
btrfs_init_devices_late to initialize it. That function already
initializes fs_info of every individual device in btrfs_fs_devices.
As far as clearing it is concerned it makes sense to move it to
close_fs_devices. That function is only called when struct
btrfs_fs_devices is no longer in use - either for holding seeds or
main devices for a mounted filesystem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return value of this function conveys absolutely no information.
All callers already check the state of fs_devices->opened to decide how
to proceed. So convert the function to returning void. While at it make
btrfs_close_devices also return void.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This prepares the code to switching seeds devices to a proper list.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io
tree") introduced btrfs_device::alloc_state extent io tree, but it
doesn't initialize the fs_info and owner member.
This means the following features are not properly supported:
- Fs owner report for insert_state() error
Without fs_info initialized, although btrfs_err() won't panic, it
won't output which fs is causing the error.
- Wrong owner for trace events
alloc_state will get the owner as pinned extents.
Fix this by assiging proper fs_info and owner for
btrfs_device::alloc_state.
Fixes: 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io tree")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be accessed from 'fs_devices' as it's identical to
fs_info->fs_devices. Also add a comment about why we are calling the
function. No semantic changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to move the closing of the src_device out of all the device
replace locking, but we definitely want to zero out the superblock
before we commit the last time to make sure the device is properly
removed. Handle this by pushing btrfs_scratch_superblocks into
btrfs_dev_replace_finishing, and then later on we'll move the src_device
closing and freeing stuff where we need it to be.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported a lockdep splat in generic/476 that I could reproduce
with btrfs/187.
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc2+ #1 Tainted: G W
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc_trace+0x3a/0x1a0
btrfs_alloc_device+0x43/0x210
add_missing_dev+0x20/0x90
read_one_chunk+0x301/0x430
btrfs_read_sys_array+0x17b/0x1b0
open_ctree+0xa62/0x1896
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x379
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
vfs_rmdir.part.0+0x149/0x160
do_rmdir+0x136/0x1a0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&fs_info->chunk_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x12d/0x150
__lock_acquire+0x1184/0x1fa0
lock_acquire+0xa4/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa4/0x3d0
? btrfs_evict_inode+0x11e/0x500
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x46/0x60
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This is because we are holding the chunk_mutex when we call
btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want
to switch that to a GFP_NOFS lock because this is the only place where
it matters. So instead use memalloc_nofs_save() around the allocation
in order to avoid the lockdep splat.
Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (btrfs-uuid-00){++++}-{3:3}:
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_uuid_tree_add+0x89/0x2d0
btrfs_uuid_scan_kthread+0x330/0x390
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
down_read_nested+0x3e/0x140
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-uuid-00);
lock(btrfs-root-00);
lock(btrfs-uuid-00);
lock(btrfs-root-00);
*** DEADLOCK ***
1 lock held by btrfs-uuid/7955:
#0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __btrfs_tree_read_lock+0x39/0x180
? btrfs_root_node+0x1c/0x1d0
down_read_nested+0x3e/0x140
? __btrfs_tree_read_lock+0x39/0x180
__btrfs_tree_read_lock+0x39/0x180
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x4bd/0x990
btrfs_find_root+0x45/0x1b0
btrfs_read_tree_root+0x61/0x100
btrfs_get_root_ref.part.50+0x143/0x630
btrfs_uuid_tree_iterate+0x207/0x314
? btree_readpage+0x20/0x20
btrfs_uuid_rescan_kthread+0x12/0x50
kthread+0x133/0x150
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots. The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.
It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop. So simply move
the path release before we add our entry to the uuid tree.
This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script can lead to tons of beyond device boundary access:
mkfs.btrfs -f $dev -b 10G
mount $dev $mnt
trimfs $mnt
btrfs filesystem resize 1:-1G $mnt
trimfs $mnt
[CAUSE]
Since commit 929be17a9b ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.
So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.
This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.
Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.
[FIX]
This patch will fix the problem in two ways:
- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
This is the root fix
- Add extra safety check when trimming free device extents
We check and warn if the returned range is already beyond current
device.
Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are currently getting this lockdep splat in btrfs/161:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc5+ #20 Tainted: G E
------------------------------------------------------
mount/678048 is trying to acquire lock:
ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
but task is already holding lock:
ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x8b/0x8f0
btrfs_init_new_device+0x2d2/0x1240 [btrfs]
btrfs_ioctl+0x1de/0x2d20 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__lock_acquire+0x1240/0x2460
lock_acquire+0xab/0x360
__mutex_lock+0x8b/0x8f0
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
*** DEADLOCK ***
3 locks held by mount/678048:
#0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
#1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
#2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
stack backtrace:
CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
Call Trace:
dump_stack+0x96/0xd0
check_noncircular+0x162/0x180
__lock_acquire+0x1240/0x2460
? asm_sysvec_apic_timer_interrupt+0x12/0x20
lock_acquire+0xab/0x360
? clone_fs_devices+0x4d/0x170 [btrfs]
__mutex_lock+0x8b/0x8f0
? clone_fs_devices+0x4d/0x170 [btrfs]
? rcu_read_lock_sched_held+0x52/0x60
? cpumask_next+0x16/0x20
? module_assert_mutex_or_preempt+0x14/0x40
? __module_address+0x28/0xf0
? clone_fs_devices+0x4d/0x170 [btrfs]
? static_obj+0x4f/0x60
? lockdep_init_map_waits+0x43/0x200
? clone_fs_devices+0x4d/0x170 [btrfs]
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
? super_setup_bdi_name+0x79/0xd0
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? vfs_parse_fs_string+0x84/0xb0
? rcu_read_lock_sched_held+0x52/0x60
? kfree+0x2b5/0x310
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? rcu_read_lock_sched_held+0x52/0x60
? legacy_get_tree+0x30/0x50
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
? memdup_user+0x4e/0x90
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
then read the device, which takes the device_list_mutex. The
device_list_mutex needs to be taken before the chunk_mutex, so this is a
problem. We only really need the chunk mutex around adding the chunk,
so move the mutex around read_one_chunk.
An argument could be made that we don't even need the chunk_mutex here
as it's during mount, and we are protected by various other locks.
However we already have special rules for ->device_list_mutex, and I'd
rather not have another special case for ->chunk_mutex.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since most metadata reservation calls can return -EINTR when get
interrupted by fatal signal, we need to review the all the metadata
reservation call sites.
In relocation code, the metadata reservation happens in the following
sites:
- btrfs_block_rsv_refill() in merge_reloc_root()
merge_reloc_root() is a pretty critical section, we don't want to be
interrupted by signal, so change the flush status to
BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
Since such change can be ENPSPC-prone, also shrink the amount of
metadata to reserve least amount avoid deadly ENOSPC there.
- btrfs_block_rsv_refill() in reserve_metadata_space()
It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
by signal.
- btrfs_block_rsv_refill() in prepare_to_relocate()
- btrfs_block_rsv_add() in prepare_to_relocate()
- btrfs_block_rsv_refill() in relocate_block_group()
- btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
- btrfs_start_transaction() in relocate_block_group()
- btrfs_start_transaction() in create_reloc_inode()
Can be interrupted by fatal signal and we can handle it easily.
For these call sites, just catch the -EINTR value in btrfs_balance()
and count them as canceled.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The whole chunk tree is read at mount time so we can utilize readahead
to get the tree blocks to memory before we read the items. The idea is
from Robbie, but instead of updating search slot readahead, this patch
implements the chunk tree readahead manually from nodes on level 1.
We've decided to do specific readahead optimizations and then unify them
under a common API so we don't break everything by changing the search
slot readahead logic.
Higher chunk trees grow on large filesystems (many terabytes), and
prefetching just level 1 seems to be sufficient. Provided example was
from a 200TiB filesystem with chunk tree level 2.
CC: Robbie Ko <robbieko@synology.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since btrfs_bio always contains the extra space for the tgtdev_map and
raid_maps it's pointless to make the assignment iff specific conditions
are met.
Instead, always assign the pointers to their correct value at allocation
time. To accommodate this change also move code a bit in
__btrfs_map_block so that btrfs_bio::stripes array is always initialized
before the raid_map, subsequently move the call to sort_parity_stripes
in the 'if' building the raid_map, retaining the old behavior.
To better understand the change, there are 2 aspects to this:
1. The original code is harder to grasp because the calculations for
initializing raid_map/tgtdev ponters are apart from the initial
allocation of memory. Having them predicated on 2 separate checks
doesn't help that either... So by moving the initialisation in
alloc_btrfs_bio puts everything together.
2. tgtdev/raid_maps are now always initialized despite sometimes they
might be equal i.e __btrfs_map_block_for_discard calls
alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
on external checks i.e. just because those pointers are non-null
doesn't mean they are valid per-se. And actually while taking another
look at __btrfs_map_block I saw a discrepancy:
Original code initialised tgtdev_map if the following check is true:
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
However, further down tgtdev_map is only used if the following check
is true:
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
e.g. the additional need_full_stripe(op) predicate is there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more details from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_bio ensures that all submitted bios to devices have valid
btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
check was added in june 2012 597a60fade ("Btrfs: don't count I/O
statistic read errors for missing devices") but then in October of the
same year another commit de1ee92ac3 ("Btrfs: recheck bio against
block device when we map the bio") started checking for the presence of
btrfs_device::bdev before actually issuing the bio.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of recording stripe_index and using that to access correct
btrfs_device from btrfs_bio::stripes record the btrfs_device in
btrfs_io_bio. This will enable endio handlers to increment device
error counters on checksum errors.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:
[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.
Status of the filesystem (100GiB) just after the balance fails:
$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit dccdb07bc9 ("btrfs: kill btrfs_fs_info::volume_mutex") removed
the last use of the volume_mutex, forgetting to update the comment.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.
By having an alien device in fs_devices, we have two issues so far
1. missing device does not not show as missing in the userland
2. degraded mount will fail
Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).
A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.
And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.
This patch fixes both issues above by removing the alien device.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number. For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.
But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].
[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid
Consider a working filesystem with raid1:
$ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
$ mount /dev/sda /mnt-raid1
$ umount /mnt-raid1
While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt-single
$ btrfs dev add -f /dev/sda /mnt-single
Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.
$ mount -o degraded /dev/sdb /mnt-raid1
[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
kernel: BTRFS error (device sdb): failed to read devices
kernel: BTRFS error (device sdb): open_ctree failed
Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need of goto out in open_fs_devices() as there is nothing
special done there.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of returning both the page and the super block structure, make
btrfs_read_disk_super just return a pointer to struct btrfs_disk_super.
As a result the function signature is simplified. Also,
read_cache_page_gfp can never return NULL so check its return value only
for IS_ERR.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new error injection point, should_cancel_balance().
It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
allows us to override the return value.
Currently there are only one locations using this function:
- btrfs_balance()
It checks cancel before each block group.
There are other locations checking fs_info->balance_cancel_req, but they
are not used as an indicator to exit, so there is no need to use the
wrapper.
But there will be more locations coming, and some locations can cause
kernel panic if not handled properly. So introduce this error injection
to provide better test interface.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The validation follows the same steps for all three block group types,
the existing helper validate_convert_profile can be enhanced and do more
of the common things.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having btrfs_alloc_chunk doesn't bring any value since it
encapsulates a lockdep assert and a call to find_next_chunk. Simply
rename the internal __btrfs_alloc_chunk function to the public one
and remove it's 2nd parameter as all callers always pass the return
value of find_next_chunk. Finally, migrate the call to
lockdep_assert_held so as to not lose the check.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we ignore a device whose available space is less than
"BTRFS_STRIPE_LEN * dev_stripes". This is a lower limit for current
allocation policy (to maximize the number of stripes). This commit
parameterizes dev_extent_min, so that other policies can set their own
lower limitat to ignore a device with insufficient space.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out create_chunk() from __btrfs_alloc_chunk(). This function
finally creates a chunk. There is no functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out decide_stripe_size() from __btrfs_alloc_chunk(). This
function calculates the actual stripe size to allocate.
decide_stripe_size() handles the common case to round down the 'ndevs'
to 'devs_increment' and check the upper and lower limitation of 'ndevs'.
decide_stripe_size_regular() decides the size of a stripe and the size
of a chunk. The policy is to maximize the number of stripes.
This commit has no functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out gather_device_info() from __btrfs_alloc_chunk(). This
function iterates over devices list and gather information about
devices. This commit also introduces "max_avail" and
"dev_extent_min" to fold the same calculation to one variable.
This commit has no functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out init_alloc_chunk_ctl() from __btrfs_alloc_chunk(). This
function initialises parameters of "struct alloc_chunk_ctl" for
allocation. init_alloc_chunk_ctl() handles a common part of the
initialisation to load the RAID parameters from btrfs_raid_array.
init_alloc_chunk_ctl_policy_regular() decides some parameters for its
allocation.
The last "else" case in the original code is moved to
__btrfs_alloc_chunk() to handle the error case in the common code.
Replace the BUG_ON with ASSERT() and error return at the same time.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce "struct alloc_chunk_ctl" to wrap needed parameters for the
chunk allocation. This will be used to split __btrfs_alloc_chunk() into
smaller functions.
This commit folds a number of local variables in __btrfs_alloc_chunk()
into one "struct alloc_chunk_ctl ctl". There is no functional change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out two functions from find_free_dev_extent_start().
dev_extent_search_start() decides the starting position of the search.
dev_extent_hole_check() checks if a hole found is suitable for device
extent allocation.
These functions also have the switch-cases to change the allocation
behavior depending on the policy.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk().
Instead return -EINVAL with ASSERT() to catch a bug in the development
stage.
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't use the u_XX types anywhere, though they're defined.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove trivial comprator and open coded swap of two values.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a simple forwarded call based on the operation that would better
fit the caller btrfs_map_block that's until now a trivial wrapper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.
We shouldn't block umount on this, simply bail if we're unmounting the
fs. We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.
Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae ("btrfs: access
superblock via pagecache in scan_one_device").
This patch unwraps the buffer head API and does not change the way the
super block is actually read.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static. Also move it
above it's callers so we don't need a forward declaration.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block device mappings are never in highmem so kmap() / kunmap() calls for
pages from block devices are unneeded. Use page_address() instead of
kmap() to get to the virtual addreses.
While we're at it, read_cache_page_gfp() doesn't return NULL on error,
only an ERR_PTR, so use IS_ERR() to check for errors.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch for removal of buffer_head usage in btrfs.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and
device attributes"), the functions btrfs_sysfs_add_device_link() and
btrfs_sysfs_rm_device_link() do more than just adding and removing the
device link as its name indicated. Rename them to be more specific
that's about the directory with the attirbutes
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current code doesn't correctly handle the situation which arises when
a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
changed to the one in metadata uuid. This causes the incompat flag to
disappear.
In case of a power failure we could end up in a situation where part of
the disks in a multi-disk filesystem are correctly reverted to
METADATA_UUID_INCOMPAT flag unset state, while others have
METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
This patch corrects the behavior required to handle the case where a
disk of the second type is scanned first, creating the necessary
btrfs_fs_devices. Subsequently, when a disk which has already completed
the transition is scanned it should overwrite the data in
btrfs_fs_devices.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.
Fixes: cc5de4e702 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_fsid became rather hairy with the introduction of metadata uuid
changing feature. Alleviate this by factoring out the metadata uuid
specific code in a dedicated function which deals with finding
correct fsid for a device with changed uuid.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since find_fsid_inprogress should also handle the case in which an fs
didn't change its FSID make it call find_fsid directly. This makes the
code in device_list_add simpler by eliminating a conditional call of
find_fsid. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.
For example:
BTRFS info (device sdc): device stats zeroed by btrfs(9223)
Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.
This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.
For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.
On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.
discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701
discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200
While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When closing a device, btrfs_close_one_device() first allocates a new
device, copies the device to close's name, replaces it in the dev_list
with the copy and then finally frees it.
This involves two memory allocation, which can potentially fail. As this
code path is tricky to unwind, the allocation failures where handled by
BUG_ON()s.
But this copying isn't strictly needed, all that is needed is resetting
the device in question to it's state it had after the allocation.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_close_one_device we're decrementing the number of open devices
before we're calling btrfs_close_bdev().
As there is no intermediate exit between these points in this function it
is technically OK to do so, but it makes the code a bit harder to understand.
Move both operations closer together and move the decrement step after
btrfs_close_bdev().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a user report, that cppcheck is complaining about a possible
NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev().
We're first dereferencing the 'tgtdev' variable and the later check for
the validity of the pointer with a WARN_ON(!tgtdev);
But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly
check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the
WARN_ON() is impossible to hit. Just remove it to silence the checker's
complains.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
The value 0 for devs_max means to spread the allocated chunks over all
available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
4 copies respectively.
Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.
The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.
The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old code goes:
offset = logical - em->start;
length = min_t(u64, em->len - offset, length);
Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:
offset = logical - em->start;
length = min_t(u64, em->start + em->len - logical, length);
Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.
Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key. When the item is searched or written, a
local variable for key is used.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.
The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and
then process it we could just bail out early if the read disk key wasn't
a BTRFS_CHUNK_ITEM_KEY.
This removes a level of indentation and makes the code nicer to read.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>