OpenCloudOS-Kernel

History

Filipe Manana 626e9f41f7 btrfs: fix race leading to unpersisted data and metadata on fsync When doing a fast fsync on a file, there is a race which can result in the fsync returning success to user space without logging the inode and without durably persisting new data. The following example shows one possible scenario for this: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/bar $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz # Now we have: # file bar == inode 257 # file baz == inode 258 $ mv /mnt/baz /mnt/foo # Now we have: # file bar == inode 257 # file foo == inode 258 $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo # fsync bar before foo, it is important to trigger the race. $ xfs_io -c "fsync" /mnt/bar $ xfs_io -c "fsync" /mnt/foo # After this: # inode 257, file bar, is empty # inode 258, file foo, has 1M filled with 0xcd <power failure> # Replay the log: $ mount /dev/sdc /mnt # After this point file foo should have 1M filled with 0xcd and not 0xab The following steps explain how the race happens: 1) Before the first fsync of inode 258, when it has the "baz" name, its ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1. The inode also has the full sync flag set; 2) After the first fsync, we set inode 258 ->logged_trans to 6, which is the generation of the current transaction, and set ->last_log_commit to 0, which is the current value of ->last_sub_trans (done at btrfs_log_inode()). The full sync flag is cleared from the inode during the fsync. The log sub transaction that was committed had an ID of 0 and when we synced the log, at btrfs_sync_log(), we incremented root->log_transid from 0 to 1; 3) During the rename: We update inode 258, through btrfs_update_inode(), and that causes its ->last_sub_trans to be set to 1 (the current log transaction ID), and ->last_log_commit remains with a value of 0. After updating inode 258, because we have previously logged the inode in the previous fsync, we log again the inode through the call to btrfs_log_new_name(). This results in updating the inode's ->last_log_commit from 0 to 1 (the current value of its ->last_sub_trans). The ->last_sub_trans of inode 257 is updated to 1, which is the ID of the next log transaction; 4) Then a buffered write against inode 258 is made. This leaves the value of ->last_sub_trans as 1 (the ID of the current log transaction, stored at root->log_transid); 5) Then an fsync against inode 257 (or any other inode other than 258), happens. This results in committing the log transaction with ID 1, which results in updating root->last_log_commit to 1 and bumping root->log_transid from 1 to 2; 6) Then an fsync against inode 258 starts. We flush delalloc and wait only for writeback to complete, since the full sync flag is not set in the inode's runtime flags - we do not wait for ordered extents to complete. Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the ordered extent completes. The call returns true: static inline bool btrfs_inode_in_log(...) { bool ret = false; spin_lock(&inode->lock); if (inode->logged_trans == generation && inode->last_sub_trans <= inode->last_log_commit && inode->last_sub_trans <= inode->root->last_log_commit) ret = true; spin_unlock(&inode->lock); return ret; } generation has a value of 6 (fs_info->generation), ->logged_trans also has a value of 6 (set when we logged the inode during the first fsync and when logging it during the rename), ->last_sub_trans has a value of 1, set during the rename (step 3), ->last_log_commit also has a value of 1 (set in step 3) and root->last_log_commit has a value of 1, which was set in step 5 when fsyncing inode 257. As a consequence we don't log the inode, any new extents and do not sync the log, resulting in a data loss if a power failure happens after the fsync and before the current transaction commits. Also, because we do not log the inode, after a power failure the mtime and ctime of the inode do not match those we had before. When the ordered extent completes before we call btrfs_inode_in_log(), then the call returns false and we log the inode and sync the log, since at the end of ordered extent completion we update the inode and set ->last_sub_trans to 2 (the value of root->log_transid) and ->last_log_commit to 1. This problem is found after removing the check for the emptiness of the inode's list of modified extents in the recent commit `209ecbb858` ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"), added in the 5.13 merge window. However checking the emptiness of the list is not really the way to solve this problem, and was never intended to, because while that solves the problem for COW writes, the problem persists for NOCOW writes because in that case the list is always empty. In the case of NOCOW writes, even though we wait for the writeback to complete before returning from btrfs_sync_file(), we end up not logging the inode, which has a new mtime/ctime, and because we don't sync the log, we never issue disk barriers (send REQ_PREFLUSH to the device) since that only happens when we sync the log (when we write super blocks at btrfs_sync_log()). So effectively, for a NOCOW case, when we return from btrfs_sync_file() to user space, we are not guaranteeing that the data is durably persisted on disk. Also, while the example above uses a rename exchange to show how the problem happens, it is not the only way to trigger it. An alternative could be adding a new hard link to inode 258, since that also results in calling btrfs_log_new_name() and updating the inode in the log. An example reproducer using the addition of a hard link instead of a rename operation: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/bar $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo $ ln /mnt/foo /mnt/foo_link $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo $ xfs_io -c "fsync" /mnt/bar $ xfs_io -c "fsync" /mnt/foo <power failure> # Replay the log: $ mount /dev/sdc /mnt # After this point file foo often has 1M filled with 0xab and not 0xcd The reasons leading to the final fsync of file foo, inode 258, not persisting the new data are the same as for the previous example with a rename operation. So fix by never skipping logging and log syncing when there are still any ordered extents in flight. To avoid making the conditional if statement that checks if logging an inode is needed harder to read, place all the logic into an helper function with separate if statements to make it more manageable and easier to read. A test case for fstests will follow soon. For NOCOW writes, the problem existed before commit `b5e6c3e170` ("btrfs: always wait on ordered extents at fsync time"), introduced in kernel 4.19, then it went away with that commit since we started to always wait for ordered extent completion before logging. The problem came back again once the fast fsync path was changed again to avoid waiting for ordered extent completion, in commit `487781796d` ("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10. However, for COW writes, the race only happens after the recent commit `209ecbb858` ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW writes, the bug existed before that commit. So tag 5.10+ as the release for stable backports. CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2021-04-28 20:09:45 +02:00
..
9p	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-27 08:07:12 -08:00
adfs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
affs	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
afs	afs: Use wait_on_page_writeback_killable	2021-03-23 20:54:37 +00:00
autofs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
befs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
bfs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
btrfs	btrfs: fix race leading to unpersisted data and metadata on fsync	2021-04-28 20:09:45 +02:00
cachefiles	cachefiles, afs: mm wait fixes	2021-03-24 10:22:00 -07:00
ceph	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
cifs	cifs: escape spaces in share names	2021-04-07 21:30:27 -05:00
coda	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
configfs	configfs: fix a use-after-free in __configfs_open_file	2021-03-11 12:13:48 +01:00
cramfs	cramfs: use %pD instead of messing with file_dentry()->d_name	2021-01-05 23:02:47 -05:00
crypto	block: rename BIO_MAX_PAGES to BIO_MAX_VECS	2021-03-11 07:47:48 -07:00
debugfs	Driver core / debugfs update for 5.12-rc1	2021-02-24 10:13:55 -08:00
devpts	…
dlm	fs: dlm: check on existing node address	2020-11-10 12:14:20 -06:00
ecryptfs	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
efivarfs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
efs	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
erofs	Change since last update:	2021-03-13 12:26:22 -08:00
exfat	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
exportfs	exportfs: Add a function to return the raw output from fh_to_dentry()	2020-12-09 09:39:38 -05:00
ext2	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
ext4	Miscellaneous ext4 bug fixes for v5.12.	2021-03-21 14:06:10 -07:00
f2fs	block: rename BIO_MAX_PAGES to BIO_MAX_VECS	2021-03-11 07:47:48 -07:00
fat	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
freevxfs	…
fscache	…
fuse	fuse: 32-bit user space ioctl compat for fuse device	2021-03-16 15:20:16 +01:00
gfs2	Two more gfs2 fixes	2021-04-03 12:15:01 -07:00
hfs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
hfsplus	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
hostfs	hostfs: fix memory handling in follow_link()	2021-03-25 18:57:42 -04:00
hpfs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
hugetlbfs	hugetlbfs: remove unneeded return value of hugetlb_vmtruncate()	2021-02-24 13:38:35 -08:00
iomap	Merge branch 'iomap-5.12-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	2021-03-18 10:37:30 -07:00
isofs	isofs: handle large user and group ID	2021-02-03 19:05:52 +01:00
jbd2	block: use an on-stack bio in blkdev_issue_flush	2021-01-27 09:51:48 -07:00
jffs2	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
jfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-27 08:07:12 -08:00
kernfs	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
lockd	SUNRPC: Make trace_svc_process() display the RPC procedure symbolically	2021-01-25 09:36:23 -05:00
minix	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
nfs	nfs: we don't support removing system.nfs4_acl	2021-03-11 13:17:42 -05:00
nfs_common	NFSv4_2: SSC helper should use its own config.	2021-01-28 10:55:37 -05:00
nfsd	NFSD: fix error handling in NFSv4.0 callbacks	2021-03-11 10:58:49 -05:00
nilfs2	block: rename BIO_MAX_PAGES to BIO_MAX_VECS	2021-03-11 07:47:48 -07:00
nls	…
notify	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
ntfs	ntfs: check for valid standard information attribute	2021-02-24 13:38:26 -08:00
ocfs2	ocfs2: fix deadlock between setattr and dio_end_io_write	2021-04-09 14:54:23 -07:00
omfs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
openpromfs	…
orangefs	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
overlayfs	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
proc	mm: use is_cow_mapping() across tree where proper	2021-03-13 11:27:30 -08:00
pstore	pstore fixes for v5.12-rc2	2021-03-05 17:21:25 -08:00
qnx4	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
qnx6	[PATCH] reduce boilerplate in fsid handling	2020-09-18 16:45:50 -04:00
quota	quota: Fix memory leak when handling corrupted quota file	2021-01-05 14:42:18 +01:00
ramfs	ramfs: support O_TMPFILE	2021-02-24 13:38:26 -08:00
reiserfs	reiserfs: update reiserfs_xattrs_initialized() condition	2021-03-30 14:27:32 -07:00
romfs	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2020-10-24 12:26:05 -07:00
squashfs	squashfs: fix xattr id and id lookup sanity checks	2021-03-25 09:22:55 -07:00
sysfs	sysfs: Support zapping of binary attr mmaps	2021-01-12 14:26:31 +01:00
sysv	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
tracefs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
ubifs	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
udf	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
ufs	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
unicode	unicode: Add utf8_casefold_hash	2020-09-10 14:03:31 -07:00
vboxsf	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
verity	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
xfs	xfs: also reject BULKSTAT_SINGLE in a mount user namespace	2021-03-15 08:50:41 -07:00
zonefs	zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()	2021-03-17 08:56:50 +09:00
Kconfig	s390,alpha: make TMPFS_INODE64 available again	2021-03-08 10:46:30 +01:00
Kconfig.binfmt	Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-21 09:29:23 -08:00
Makefile	fs: Remove dcookies support	2021-01-29 10:06:46 +05:30
aio.c	Merge branch 'akpm' (patches from Andrew)	2020-12-15 12:53:37 -08:00
anon_inodes.c	fs: anon_inodes: rephrase to appropriate kernel-doc	2021-01-15 12:17:25 -05:00
attr.c	ima: handle idmapped mounts	2021-01-24 14:27:20 +01:00
bad_inode.c	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
binfmt_aout.c	…
binfmt_elf.c	Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux	2021-02-21 13:20:41 -08:00
binfmt_elf_fdpic.c	Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux	2021-02-21 13:20:41 -08:00
binfmt_em86.c	…
binfmt_flat.c	binfmt_flat: revert "binfmt_flat: don't offset the data start"	2020-08-24 08:49:13 +10:00
binfmt_misc.c	binfmt_misc: fix possible deadlock in bm_register_write	2021-03-13 11:27:30 -08:00
binfmt_script.c	…
block_dev.c	block: don't ignore REQ_NOWAIT for direct IO	2021-04-02 08:34:30 -06:00
buffer.c	fs: buffer: use raw page_memcg() on locked page	2021-02-24 13:38:30 -08:00
char_dev.c	…
compat_binfmt_elf.c	get rid of COMPAT_ELF_EXEC_PAGESIZE	2021-01-06 08:42:51 -05:00
coredump.c	fs/coredump: use kmap_local_page()	2021-02-26 09:41:05 -08:00
d_path.c	fs: fix NULL dereference due to data race in prepend_path()	2020-10-14 14:54:45 -07:00
dax.c	mm: provide a saner PTE walking API for modules	2021-02-09 07:05:44 -05:00
dcache.c	fs: delete repeated words in comments	2021-02-24 13:38:26 -08:00
direct-io.c	fs: direct-io: fix missing sdio->boundary	2021-04-09 14:54:23 -07:00
drop_caches.c	…
eventfd.c	eventfd: Export eventfd_ctx_do_read()	2020-11-15 09:49:10 -05:00
eventpoll.c	kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE	2021-02-16 09:59:41 +01:00
exec.c	fs: delete repeated words in comments	2021-02-24 13:38:26 -08:00
fcntl.c	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
fhandle.c	fs: delete repeated words in comments	2021-02-24 13:38:26 -08:00
file.c	file: fix close_range() for unshare+cloexec	2021-04-02 14:11:10 +02:00
file_table.c	epoll: take epitem list out of struct file	2020-10-25 20:02:08 -04:00
filesystems.c	…
fs-writeback.c	fs: improve comments for writeback_single_inode()	2021-01-13 17:26:50 +01:00
fs_context.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
fs_parser.c	fs_parse: mark fs_param_bad_value() as static	2020-10-13 18:38:27 -07:00
fs_pin.c	…
fs_struct.c	vfs: Use sequence counter with associated spinlock	2020-07-29 16:14:27 +02:00
fs_types.c	…
fsopen.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
init.c	init: handle idmapped mounts	2021-01-24 14:27:19 +01:00
inode.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-27 08:07:12 -08:00
internal.h	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
io-wq.c	io-wq: cancel unbounded works on io-wq destroy	2021-04-08 13:33:17 -06:00
io-wq.h	io_uring: remove structures from include/linux/io_uring.h	2021-03-18 09:44:35 -06:00
io_uring.c	io_uring: fix early sqd_list removal sqpoll hangs	2021-04-14 13:07:27 -06:00
ioctl.c	fs: remove ksys_ioctl	2020-07-31 08:16:01 +02:00
kernel_read_file.c	fs/kernel_file_read: Add "offset" arg for partial reads	2020-10-05 13:37:04 +02:00
libfs.c	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
locks.c	Revert "nfsd4: a client's own opens needn't prevent delegations"	2021-03-09 10:37:34 -05:00
mbcache.c	…
mount.h	mount: make {lock,unlock}_mount_hash() static	2021-01-24 14:29:34 +01:00
mpage.c	block: rename BIO_MAX_PAGES to BIO_MAX_VECS	2021-03-11 07:47:48 -07:00
namei.c	LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late	2021-04-06 20:33:00 -04:00
namespace.c	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-02-27 08:07:12 -08:00
no-block.c	…
nsfs.c	…
open.c	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
pipe.c	fs: delete repeated words in comments	2021-02-24 13:38:26 -08:00
pnode.c	…
pnode.h	mount: fix mounting of detached mounts onto targets that reside on shared mounts	2021-03-08 15:18:43 +01:00
posix_acl.c	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
proc_namespace.c	fs: introduce MOUNT_ATTR_IDMAP	2021-01-24 14:43:45 +01:00
read_write.c	teach sendfile(2) to handle send-to-pipe directly	2021-01-25 23:29:36 -05:00
readdir.c	readdir: make sure to verify directory entry for legacy interfaces too	2021-04-17 11:39:49 -07:00
remap_range.c	ioctl: handle idmapped mounts	2021-01-24 14:27:19 +01:00
select.c	kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()	2021-03-16 22:13:10 +01:00
seq_file.c	fs: fix kernel-doc markups	2021-01-21 14:06:00 -07:00
signalfd.c	treewide: Use fallthrough pseudo-keyword	2020-08-23 17:36:59 -05:00
splice.c	for-5.12/block-2021-02-17	2021-02-21 11:02:48 -08:00
stack.c	…
stat.c	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
statfs.c	s390,alpha: switch to 64-bit ino_t	2021-02-13 17:17:53 +01:00
super.c	It has been a relatively quiet cycle in docsland.	2021-02-22 10:57:46 -08:00
sync.c	…
timerfd.c	…
userfaultfd.c	userfaultfd: use secure anon inodes for userfaultfd	2021-01-14 17:40:57 -05:00
utimes.c	utimes: handle idmapped mounts	2021-01-24 14:27:18 +01:00
xattr.c	namei: handle idmapped mounts in may_*() helpers	2021-01-24 14:27:17 +01:00