OpenCloudOS-Kernel/fs
Filipe Manana 8eaf40c0e2 Btrfs: fix race between block group removal and block group allocation
If a task is removing the block group that currently has the highest start
offset amongst all existing block groups, there is a short time window
where it races with a concurrent block group allocation, resulting in a
transaction abort with an error code of EEXIST.

The following diagram explains the race in detail:

      Task A                                                        Task B

 btrfs_remove_block_group(bg offset X)

   remove_extent_mapping(em offset X)
     -> removes extent map X from the
        tree of extent maps
        (fs_info->mapping_tree), so the
        next call to find_next_chunk()
        will return offset X

                                                   btrfs_alloc_chunk()
                                                     find_next_chunk()
                                                       --> returns offset X

                                                     __btrfs_alloc_chunk(offset X)
                                                       btrfs_make_block_group()
                                                         btrfs_create_block_group_cache()
                                                           --> creates btrfs_block_group_cache
                                                               object with a key corresponding
                                                               to the block group item in the
                                                               extent, the key is:
                                                               (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)

                                                         --> adds the btrfs_block_group_cache object
                                                             to the list new_bgs of the transaction
                                                             handle

                                                   btrfs_end_transaction(trans handle)
                                                     __btrfs_end_transaction()
                                                       btrfs_create_pending_block_groups()
                                                         --> sees the new btrfs_block_group_cache
                                                             in the new_bgs list of the transaction
                                                             handle
                                                         --> its call to btrfs_insert_item() fails
                                                             with -EEXIST when attempting to insert
                                                             the block group item key
                                                             (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
                                                             because task A has not removed that key yet
                                                         --> aborts the running transaction with
                                                             error -EEXIST

   btrfs_del_item()
     -> removes the block group's key from
        the extent tree, key is
        (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)

A sample transaction abort trace:

  [78912.403537] ------------[ cut here ]------------
  [78912.403811] BTRFS: Transaction aborted (error -17)
  [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
  (...)
  [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G        W         5.0.0-btrfs-next-46 #1
  [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
  [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
  (...)
  [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
  [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
  [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
  [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
  [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
  [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
  [78912.409899] FS:  00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
  [78912.410285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
  [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [78912.411898] Call Trace:
  [78912.412318]  __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
  [78912.412746]  btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
  [78912.413179]  scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
  [78912.413622]  ? __mutex_unlock_slowpath+0x100/0x2a0
  [78912.414078]  btrfs_scrub_dev+0x2ef/0x720 [btrfs]
  [78912.414535]  ? __sb_start_write+0xd4/0x1c0
  [78912.414963]  ? mnt_want_write_file+0x24/0x50
  [78912.415403]  btrfs_ioctl+0x17fb/0x3120 [btrfs]
  [78912.415832]  ? lock_acquire+0xa6/0x190
  [78912.416256]  ? do_vfs_ioctl+0xa2/0x6f0
  [78912.416685]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [78912.417116]  do_vfs_ioctl+0xa2/0x6f0
  [78912.417534]  ? __fget+0x113/0x200
  [78912.417954]  ksys_ioctl+0x70/0x80
  [78912.418369]  __x64_sys_ioctl+0x16/0x20
  [78912.418812]  do_syscall_64+0x60/0x1b0
  [78912.419231]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [78912.419644] RIP: 0033:0x7f3880252dd7
  (...)
  [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
  [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
  [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
  [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
  [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
  [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---

Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
only at the very end, after removing the block group item key from the
extent tree (and removing the free space tree entry if we are using the
free space tree feature).

Fixes: 04216820fe ("Btrfs: fix race between fs trimming and block group remove/allocation")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-06-12 15:55:42 +02:00
..
9p Pull request for inlusion in 5.1 2019-03-17 09:10:56 -07:00
adfs adfs: use timespec64 for time conversion 2018-08-22 10:52:51 -07:00
affs
afs afs: Fix in-progess ops to ignore server-level callback invalidation 2019-04-13 08:37:37 +01:00
autofs autofs: clear O_NONBLOCK on the pipe 2019-03-07 18:32:01 -08:00
befs
bfs bfs: extra sanity checking and static inode bitmap 2019-01-04 13:13:47 -08:00
btrfs Btrfs: fix race between block group removal and block group allocation 2019-06-12 15:55:42 +02:00
cachefiles fscache, cachefiles: remove redundant variable 'cache' 2018-11-30 16:00:58 +00:00
ceph ceph: fix ci->i_head_snapc leak 2019-04-23 21:37:54 +02:00
cifs cifs: fix page reference leak with readv/writev 2019-04-24 12:33:59 -05:00
coda
configfs configfs: fix registered group removal 2018-07-17 06:14:07 -07:00
cramfs Make the Cramfs code more robust against filesystem corruptions, 2018-10-30 12:46:25 -07:00
crypto fscrypt updates for v5.1 2019-03-09 10:54:24 -08:00
debugfs debugfs: fix use-after-free on symlink traversal 2019-04-01 00:31:02 -04:00
devpts fs/devpts: always delete dcache dentry-s in dput() 2019-01-24 13:38:30 -05:00
dlm socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes 2019-02-03 11:17:31 -08:00
ecryptfs crypto: clarify name of WEAK_KEY request flag 2019-01-25 18:41:52 +08:00
efivarfs efivars: Call guid_parse() against guid_t type of variable 2018-07-22 14:13:44 +02:00
efs
exportfs exportfs: do not read dentry after free 2018-11-23 09:08:17 -05:00
ext2 \n 2019-03-07 09:01:33 -08:00
ext4 Miscellaneous ext4 bug fixes for 5.1. 2019-03-24 13:41:37 -07:00
f2fs f2fs-for-5.1-rc1 2019-03-15 13:42:53 -07:00
fat fat: enable .splice_write to support splice on O_DIRECT file 2019-03-07 18:32:01 -08:00
freevxfs
fscache fscache: fix race between enablement and dropping of object 2018-11-30 15:57:31 +00:00
fuse Merge branch 'page-refs' (page ref overflow) 2019-04-14 15:09:40 -07:00
gfs2 We've only got three patches ready for this merge window: 2019-03-09 11:52:11 -08:00
hfs hfs: do not free node before using 2018-11-30 14:56:14 -08:00
hfsplus hfsplus: return file attributes on statx 2019-01-04 13:13:47 -08:00
hostfs vfs: discard ATTR_ATTR_FLAG 2018-08-17 16:20:28 -07:00
hpfs hpfs: fix spelling mistake "partion" -> "partition" 2019-03-12 09:58:03 -07:00
hugetlbfs hugetlbfs: fix memory leak for resv_map 2019-04-05 16:02:31 -10:00
isofs Update email address 2018-09-29 22:47:48 -04:00
jbd2 jbd2: jbd2_get_transaction does not need to return a value 2019-03-01 00:36:57 -05:00
jffs2 jffs2: fix use-after-free on symlink traversal 2019-04-01 00:31:02 -04:00
jfs jfs: remove redundant dquot_initialize() in jfs_evict_inode() 2018-09-20 09:28:49 -05:00
kernfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
lockd NFS: fix mount/umount race in nlmclnt. 2019-03-18 22:39:34 -04:00
minix
nfs NFSv4.1 fix incorrect return value in copy_file_range 2019-04-11 15:23:48 -04:00
nfs_common
nfsd nfsd: wake blocked file lock waiters before sending callback 2019-04-22 15:38:41 -04:00
nilfs2 XArray: Change xa_insert to return -EBUSY 2019-02-06 13:12:15 -05:00
nls
notify fanotify: Allow copying of file handle to userspace 2019-03-19 09:29:07 +01:00
ntfs mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
ocfs2 ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock 2019-03-29 10:01:37 -07:00
omfs
openpromfs fs/openpromfs: Use of_node_name_eq for node name comparisons 2018-11-18 13:35:19 -08:00
orangefs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 13:27:20 -07:00
overlayfs ovl: Do not lose security.capability xattr over metadata file copy-up 2019-02-13 11:14:46 +01:00
proc fs/proc/proc_sysctl.c: Fix a NULL pointer dereference 2019-04-26 09:18:05 -07:00
pstore pstore/ram: Avoid needless alloc during header write 2019-02-12 13:45:53 -08:00
qnx4
qnx6
quota quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. 2018-12-18 18:29:15 +01:00
ramfs
reiserfs reiserfs: remove workaround code for GCC 3.x 2018-10-31 08:54:14 -07:00
romfs
squashfs Squashfs: Compute expected length from inode size rather than block length 2018-08-02 09:34:02 -07:00
sysfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-16 10:31:02 -07:00
sysv sysv: return 'err' instead of 0 in __sysv_write_inode 2018-11-10 08:02:40 -05:00
tracefs tracefs: Annotate tracefs_ops with __ro_after_init 2018-07-31 11:32:44 -04:00
ubifs ubifs: fix use-after-free on symlink traversal 2019-04-01 00:31:02 -04:00
udf udf: Propagate errors from udf_truncate_extents() 2019-03-18 16:30:02 +01:00
ufs fs/ufs: use ktime_get_real_seconds for sb and cg timestamps 2018-08-17 16:20:27 -07:00
xfs xfs: serialize unaligned dio writes against all other dio writes 2019-03-26 08:37:55 -07:00
Kconfig Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
Kconfig.binfmt kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
Makefile Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
aio.c aio: use kmem_cache_free() instead of kfree() 2019-04-04 20:13:59 -04:00
anon_inodes.c anon_inode_getfile(): switch to alloc_file_pseudo() 2018-07-12 10:04:27 -04:00
attr.c
bad_inode.c get rid of 'opened' argument of ->atomic_open() - part 3 2018-07-12 10:04:20 -04:00
binfmt_aout.c a.out: remove core dumping support 2019-03-05 10:00:35 -08:00
binfmt_elf.c fs/binfmt_elf.c: spread const a little 2019-03-07 18:32:01 -08:00
binfmt_elf_fdpic.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c exec: load_script: Do not exec truncated interpreter path 2019-02-18 16:49:36 -08:00
block_dev.c block: fix the return errno for direct IO 2019-04-11 21:22:21 -06:00
buffer.c fs: fix guard_bio_eod to check for real EOD errors 2019-02-28 13:59:41 -07:00
char_dev.c
compat.c
compat_binfmt_elf.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
compat_ioctl.c media updates for v4.20-rc1 2018-10-29 14:29:58 -07:00
coredump.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
d_path.c
dax.c fs/dax: Deposit pagetable even when installing zero page 2019-03-13 13:58:46 -07:00
dcache.c fs/dcache: Track & report number of negative dentries 2019-01-30 11:02:11 -08:00
dcookies.c
direct-io.c block: allow bio_for_each_segment_all() to iterate over multi-page bvec 2019-02-15 08:40:11 -07:00
drop_caches.c fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() 2019-02-01 15:46:24 -08:00
eventfd.c
eventpoll.c epoll: use rwlock in order to reduce ep_poll_callback() contention 2019-03-07 18:32:01 -08:00
exec.c exec: increase BINPRM_BUF_SIZE to 256 2019-03-07 18:32:01 -08:00
fcntl.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
fhandle.c
file.c io_uring-2019-03-06 2019-03-08 14:48:40 -08:00
file_table.c fs: add fget_many() and fput_many() 2019-02-28 08:24:23 -07:00
filesystems.c vfs: Implement a filesystem superblock creation/configuration context 2019-02-28 03:29:26 -05:00
fs-writeback.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-01-22 14:39:38 -07:00
fs_context.c vfs: Implement logging through fs_context 2019-02-28 03:29:37 -05:00
fs_parser.c fs: fs_parser: fix printk format warning 2019-03-29 10:01:38 -07:00
fs_pin.c
fs_struct.c
fs_types.c fs: common implementation of file type 2019-01-21 17:48:13 +01:00
inode.c fs/inode.c: inode_set_flags(): replace opencoded set_mask_bits() 2019-03-05 21:07:13 -08:00
internal.h vfs: Add configuration parser helpers 2019-02-28 03:28:53 -05:00
io_uring.c io_uring: remove 'state' argument from io_{read,write} path 2019-04-23 08:17:58 -06:00
ioctl.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
iomap.c block: add BIO_NO_PAGE_REF flag 2019-03-18 10:44:48 -06:00
libfs.c
locks.c locks: wake any locks blocked on request before deadlock check 2019-03-25 08:36:24 -04:00
mbcache.c
mount.h saner handling of temporary namespaces 2019-01-30 17:44:07 -05:00
mpage.c block: allow bio_for_each_segment_all() to iterate over multi-page bvec 2019-02-15 08:40:11 -07:00
namei.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
namespace.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
no-block.c
nsfs.c
open.c fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock 2019-04-06 07:01:55 -10:00
pipe.c Merge branch 'page-refs' (page ref overflow) 2019-04-14 15:09:40 -07:00
pnode.c separate copying and locking mount tree on cross-userns copies 2019-01-30 17:14:50 -05:00
pnode.h separate copying and locking mount tree on cross-userns copies 2019-01-30 17:14:50 -05:00
posix_acl.c
proc_namespace.c
read_write.c fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock 2019-04-06 07:01:55 -10:00
readdir.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
select.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
seq_file.c fs/seq_file.c: simplify seq_file iteration code and interface 2018-08-17 16:20:28 -07:00
signalfd.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
splice.c There tracing fixes: 2019-04-26 11:09:55 -07:00
stack.c
stat.c fs: move generic stat response attr handling to vfs_getattr_nosec 2019-02-01 01:55:45 -05:00
statfs.c vfs: add vfs_get_fsid() helper 2019-02-07 16:38:35 +01:00
super.c vfs: Add some logging to the core users of the fs_context log 2019-02-28 03:29:38 -05:00
sync.c
timerfd.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
userfaultfd.c coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping 2019-04-19 09:46:05 -07:00
utimes.c y2038: syscalls: rename y2038 compat syscalls 2019-02-07 00:13:27 +01:00
xattr.c sysfs: Do not return POSIX ACL xattrs via listxattr 2018-09-18 07:30:48 -04:00