linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Ronnie Sahlberg	af08f9e79c	cifs: create a helper function to parse the query-directory response buffer Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-01-26 19:24:16 -06:00
Ronnie Sahlberg	0a17799cc0	cifs: prepare SMB2_query_directory to be used with compounding Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-01-26 19:24:16 -06:00
zhengbin	01d1bd76a1	fs/cifs/cifssmb.c: use true,false for bool variable Fixes coccicheck warning: fs/cifs/cifssmb.c:4622:3-22: WARNING: Assignment of 0/1 to bool variable fs/cifs/cifssmb.c:4756:3-22: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-01-26 19:24:16 -06:00
zhengbin	720aec0126	fs/cifs/smb2ops.c: use true,false for bool variable Fixes coccicheck warning: fs/cifs/smb2ops.c:807:2-36: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-01-26 19:24:16 -06:00
Linus Torvalds	5cf9ad0e6b	io_uring-5.5-2020-01-26 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4t79kQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjrZD/9l31+WrZhBJf4EDZRntGFdJUAxVe3rZw2Z k45P7QezZwc4+mY7WeIlV4rgsHqhzPwTZP53PVmgeGw6vG6kjWllBSM5hzS+lfFC q3mfJLLva7YckLsf6K1vOfNw9Dny26DuENHaDGPejSr2LYnRIHejBJuqiHJZigyl 8y8rbmNdWMS5/qOlGfNDfAII1z13Up30Tt4BXgX2aGITTjvEquirzRs5HrB9e2ci vHX38uXMJ6DqQJwPDq/er8GXVsVkqd10BByh3KESxgjrQ9c+2BExwdaOtkMdbayx UM3mu+49Xo/LDR0NHpJBQTeAhhl+wVZhfpyGZzng6TOgnCN/F5NOB18tmC5g8fHx vTWpBieTujVFLygwgMIoY5Qwo0Q1bYJUi3VydWm956YujhgS76UfeXC8N9Prk7XI UDnDqAjY7gTVn0EewYKa5Sd//6TqQ+WgwB8LtCiTqLOP1kIiX+Y/rXG8PrdNMskh zpWJ/lPiTzWSn40NbU+yK09S5zu6fhqlXhjVqPlHLIOreOMD3PwOMxWkmq7MIA6j /vEK9Of0cHgdaYEJfIu+kqDkoy6Tcde3iwpV+ZluexLdTE/FF5qWIG+a8phyCLz2 KXwgyvx811T7mihlLxuwvAlc//61p9X1XsbusYu/wK/NIbu0lBZx0eHkZWGlE+ko tL0Tdx7cCQ== =5jvb -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-2020-01-26' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Fix for two regressions in this cycle, both reported by the postgresql use case. One removes the added restriction on who can submit IO, making it possible for rings shared across forks to do so. The other fixes an issue for the same kind of use case, where one exiting process would cancel all IO" * tag 'io_uring-5.5-2020-01-26' of git://git.kernel.dk/linux-block: io_uring: don't cancel all work on process exit Revert "io_uring: only allow submit from owning task"	2020-01-26 12:23:04 -08:00
Linus Torvalds	b1b298914f	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Fix a use-after-free in do_last() handling of sysctl_protected_... checks. The use-after-free normally doesn't happen there, but race with rename() and it becomes possible" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_last(): fetch directory ->i_mode and ->i_uid before it's too late	2020-01-26 10:33:48 -08:00
Jens Axboe	ebe1002621	io_uring: don't cancel all work on process exit If we're sharing the ring across forks, then one process exiting means that we cancel ALL work and prevent future work. This is overly restrictive. As long as we cancel the work associated with the files from the current task, it's safe to let others persist. Normal fd close on exit will still wait (and cancel) pending work. Fixes: `fcb323cc53` ("io_uring: io_uring: add support for async work inheriting files") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-26 10:17:12 -07:00
Jens Axboe	73e08e711d	Revert "io_uring: only allow submit from owning task" This ends up being too restrictive for tasks that willingly fork and share the ring between forks. Andres reports that this breaks his postgresql work. Since we're close to 5.5 release, revert this change for now. Cc: stable@vger.kernel.org Fixes: `44d282796f` ("io_uring: only allow submit from owning task") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-26 09:56:05 -07:00
David Howells	a45ea48e2b	afs: Fix characters allowed into cell names The afs filesystem needs to prohibit certain characters from cell names, such as '/', as these are used to form filenames in procfs, leading to the following warning being generated: WARNING: CPU: 0 PID: 3489 at fs/proc/generic.c:178 Fix afs_alloc_cell() to disallow nonprintable characters, '/', '@' and names that begin with a dot. Remove the check for "@cell" as that is then redundant. This can be tested by running: echo add foo/.bar 1.2.3.4 >/proc/fs/afs/cells Note that we will also need to deal with: - Names ending in ".invalid" shouldn't be passed to the DNS. - Names that contain non-valid domainname chars shouldn't be passed to the DNS. - DNS replies that say "your-dns-needs-immediate-attention.<gTLD>" and replies containing A records that say 127.0.53.53 should be considered invalid. [https://www.icann.org/en/system/files/files/name-collision-mitigation-01aug14-en.pdf] but these need to be dealt with by the kafs-client DNS program rather than the kernel. Reported-by: syzbot+b904ba7c947a37b4b291@syzkaller.appspotmail.com Cc: stable@kernel.org Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-26 08:54:04 -08:00
Al Viro	d0cb50185a	do_last(): fetch directory ->i_mode and ->i_uid before it's too late may_create_in_sticky() call is done when we already have dropped the reference to dir. Fixes: `30aba6656f` (namei: allow restricted O_CREAT of FIFOs and regular files) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-26 09:31:07 -05:00
Linus Torvalds	a075f23dd4	for-5.5-rc8-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4sLasACgkQxWXV+ddt WDsegg/8CBQ1/pGj+8mvf+ws6f71Av8jspY2Ebr+HCjaGhD2MG3HI1kA5gC9Qnbb fQVd12M5ma2BTrIcszxwm+VMIMlDotRFzfAp8uuFJtW0aAEGMCboX6VRYWa/4I0o SmgJg0RYh926VL73qSe3S72pfIYjar30RwjVIVTmsHxL/D/lEkrHg6IGKRCe/MaN eQipth3iuFtcWmGm1+DxEySsOs7AMPg3wL8KVnQcYoDI2kg3BXFH9a4wTE6VmWsU ZjonJBA/Rl8oA2YOVDum4mL5j2c5RulWEymdVKyo1oH+8kLDOQ8snd7Bxp3qtJ1C gdVbS8gi7gT5/C+yex+ZWlAdfmCSGWj7dr7jjiELZhTrsBhtS7y+GM52GivSrJ3z TciNQtF/Y0SrZGprPMgVGAHuIKWWwSmWJPmkRB4zv/5efFFdKg8/UmcRmh6dMo83 IF4VPEBQgJLj3ja9Wns5yvW9asKNcynGeFK7aV+BlGW/wuvBW9o017c4Q04dXSAK iFpipJaR/6ZGmXlRQLa1uyKWVHNIfSFT47WJqa6Dbo6iWRE/S/MhfkZU42z2A3H9 O2qMWmZikZnPCkha6fWyNJEDxF3imC+/LBsYoEuVPR7kZ/irDnI1cJNsTocOlyj1 kgFtL5MnCBHCop9/tPGiVdin9ilHJs3q2kAkR5BNCSEqhC8mo4g= =IPUk -----END PGP SIGNATURE----- Merge tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "Here's a last minute fix for a regression introduced in this development cycle. There's a small chance of a silent corruption when device replace and NOCOW data writes happen at the same time in one block group. Metadata or COW data writes are unaffected. The extra fixup patch is there to silence an unnecessary warning" * tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: dev-replace: remove warning for unknown return codes when finished btrfs: scrub: Require mandatory block group RO for dev-replace	2020-01-25 10:55:24 -08:00
Dan Carpenter	587065dcac	fs/adfs: bigdir: Fix an error code in adfs_fplus_read() This code accidentally returns success, but it should return the -EIO error code from adfs_fplus_validate_header(). Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Fixes: `d79288b4f6` ("fs/adfs: bigdir: calculate and validate directory checkbyte") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-25 11:31:59 -05:00
David Sterba	4cea9037f8	btrfs: dev-replace: remove warning for unknown return codes when finished The fstests btrfs/011 triggered a warning at the end of device replace, [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28 [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28 [ 1892.059993] ------------[ cut here ]------------ [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs] [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942 [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs] [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286 [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889 [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080 [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000 [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000 [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200 [ 1892.108642] FS: 00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000 [ 1892.111558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0 [ 1892.115814] Call Trace: [ 1892.116896] btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs] [ 1892.118962] btrfs_ioctl+0x1d62/0x2550 [btrfs] caused by the previous patch ("btrfs: scrub: Require mandatory block group RO for dev-replace"). Hitting ENOSPC is possible and could happen when the block group is set read-only, preventing NOCOW writes to the area that's being accessed by dev-replace. This has happend with scratch devices of size 12G but not with 5G and 20G, so this is depends on timing and other activity on the filesystem. The whole replace operation is restartable, the space state should be examined by the user in any case. The error code is propagated back to the ioctl caller so the kernel warning is causing false alerts. Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-25 12:49:12 +01:00
zhangyi (F)	7f6225e446	jbd2: clean __jbd2_journal_abort_hard() and __journal_abort_soft() __jbd2_journal_abort_hard() is no longer used, so now we can merge __jbd2_journal_abort_hard() and __journal_abort_soft() these two functions into jbd2_journal_abort() and remove them. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 03:01:56 -05:00
zhangyi (F)	0e98c084a2	jbd2: make sure ESHUTDOWN to be recorded in the journal superblock Commit `fb7c02445c` ("ext4: pass -ESHUTDOWN code to jbd2 layer") want to allow jbd2 layer to distinguish shutdown journal abort from other error cases. So the ESHUTDOWN should be taken precedence over any other errno which has already been recoded after EXT4_FLAGS_SHUTDOWN is set, but it only update errno in the journal suoerblock now if the old errno is 0. Fixes: `fb7c02445c` ("ext4: pass -ESHUTDOWN code to jbd2 layer") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 03:00:20 -05:00
zhangyi (F)	51f57b01e4	ext4, jbd2: ensure panic when aborting with zero errno JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2 aborted, and then __ext4_abort() and ext4_handle_error() can invoke panic if ERRORS_PANIC is specified. But if the journal has been aborted with zero errno, jbd2_journal_abort() didn't set this flag so we can no longer panic. Fix this by always record the proper errno in the journal superblock. Fixes: `4327ba52af` ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:59:25 -05:00
zhangyi (F)	d0a186e0d3	jbd2: switch to use jbd2_journal_abort() when failed to submit the commit record We invoke jbd2_journal_abort() to abort the journal and record errno in the jbd2 superblock when committing journal transaction besides the failure on submitting the commit record. But there is no need for the case and we can also invoke jbd2_journal_abort() instead of __jbd2_journal_abort_hard(). Fixes: `818d276ceb` ("ext4: Add the journal checksum feature") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:58:46 -05:00
Vasily Averin	1a8e9cf40c	jbd2_seq_info_next should increase position index if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. Script below generates endless output $ q=;while read -r r;do echo "$((++q)) $r";done </proc/fs/jbd2/DEV/info https://bugzilla.kernel.org/show_bug.cgi?id=206283 Fixes: `1f4aace60b` ("fs/seq_file.c: simplify seq_file iteration code and interface") Cc: stable@kernel.org Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:30:46 -05:00
Shijie Luo	17c51d836c	jbd2: remove pointless assertion in __journal_remove_journal_head Only when jh->b_jcount = 0 in jbd2_journal_put_journal_head, we are allowed to call __journal_remove_journal_head. This assertion is meaningless, just remove it. Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200123070054.50585-1-luoshijie1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:25:56 -05:00
Shijie Luo	8d6ce13679	ext4,jbd2: fix comment and code style Fix comment and remove unneccessary blank. Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200123064325.36358-1-luoshijie1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:24:53 -05:00
wangyan	0c1cba6cca	jbd2: delete the duplicated words in the comments Delete the duplicated words "is" in the comments Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/12087f77-ab4d-c7ba-53b4-893dbf0026f0@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:23:29 -05:00
Dmitry Monakhov	52144d893d	ext4: fix extent_status trace points Show pblock only if it has meaningful value. # before ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 576460752303423487 H ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 576460752303423487 HR # after ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 0 H ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 0 HR Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20191114200147.1073-2-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 02:03:03 -05:00
Chengguang Xu	57c32ea42f	ext4: choose hardlimit when softlimit is larger than hardlimit in ext4_statfs_project() Setting softlimit larger than hardlimit seems meaningless for disk quota but currently it is allowed. In this case, there may be a bit of comfusion for users when they run df comamnd to directory which has project quota. For example, we set 20M softlimit and 10M hardlimit of block usage limit for project quota of test_dir(project id 123). [root@hades mnt_ext4]# repquota -P -a *** Report for project quotas on device /dev/loop0 Block grace time: 7days; Inode grace time: 7days Block limits File limits Project used soft hard grace used soft hard grace ---------------------------------------------------------------------- 0 -- 13 0 0 2 0 0 123 -- 10237 20480 10240 5 200 100 The result of df command as below: [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 20M 10M 10M 50% /home/cgxu/test/mnt_ext4 Even though it looks like there is another 10M free space to use, if we write new data to diretory test_dir(inherit project id), the write will fail with errno(-EDQUOT). After this patch, the df result looks like below. [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 10M 10M 3.0K 100% /home/cgxu/test/mnt_ext4 Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-25 01:53:42 -05:00
Eric Biggers	ec772f0130	ext4: fix race conditions in ->d_compare() and ->d_hash() Since ->d_compare() and ->d_hash() can be called in RCU-walk mode, ->d_parent and ->d_inode can be concurrently modified, and in particular, ->d_inode may be changed to NULL. For ext4_d_hash() this resulted in a reproducible NULL dereference if a lookup is done in a directory being deleted, e.g. with: int main() { if (fork()) { for (;;) { mkdir("subdir", 0700); rmdir("subdir"); } } else { for (;;) access("subdir/file", 0); } } ... or by running the 't_encrypted_d_revalidate' program from xfstests. Both repros work in any directory on a filesystem with the encoding feature, even if the directory doesn't actually have the casefold flag. I couldn't reproduce a crash in ext4_d_compare(), but it appears that a similar crash is possible there. Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and falling back to the case sensitive behavior if the inode is NULL. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: `b886ee3e77` ("ext4: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.2+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-24 22:35:03 -05:00
Theodore Ts'o	244adf6426	ext4: make dioread_nolock the default This fixes the direct I/O versus writeback race which can reveal stale data, and it improves the tail latency of commits on slow devices. Link: https://lore.kernel.org/r/20200125022254.1101588-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-24 21:23:12 -05:00
Sebastian Andrzej Siewior	cb923159bb	smp: Remove allocation mask from on_each_cpu_cond.*() The allocation mask is no longer used by on_each_cpu_cond() and on_each_cpu_cond_mask() and can be removed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200117090137.1205765-4-bigeasy@linutronix.de	2020-01-24 20:40:09 +01:00
Eric Biggers	80f2388afa	f2fs: fix race conditions in ->d_compare() and ->d_hash() Since ->d_compare() and ->d_hash() can be called in RCU-walk mode, ->d_parent and ->d_inode can be concurrently modified, and in particular, ->d_inode may be changed to NULL. For f2fs_d_hash() this resulted in a reproducible NULL dereference if a lookup is done in a directory being deleted, e.g. with: int main() { if (fork()) { for (;;) { mkdir("subdir", 0700); rmdir("subdir"); } } else { for (;;) access("subdir/file", 0); } } ... or by running the 't_encrypted_d_revalidate' program from xfstests. Both repros work in any directory on a filesystem with the encoding feature, even if the directory doesn't actually have the casefold flag. I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a similar crash is possible there. Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and falling back to the case sensitive behavior if the inode is NULL. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: `2c2eb7a300` ("f2fs: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.4+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-01-24 10:04:09 -08:00
Eric Biggers	5515eae647	f2fs: fix dcache lookup of !casefolded directories Do the name comparison for non-casefolded directories correctly. This is analogous to ext4's commit `66883da1ee` ("ext4: fix dcache lookup of !casefolded directories"). Fixes: `2c2eb7a300` ("f2fs: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.4+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-01-24 09:53:02 -08:00
Qu Wenruo	1bbb97b8ce	btrfs: scrub: Require mandatory block group RO for dev-replace [BUG] For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071, looped runs can lead to random failure, where scrub finds csum error. The possibility is not high, around 1/20 to 1/100, but it's causing data corruption. The bug is observable after commit `b12de52896` ("btrfs: scrub: Don't check free space before marking a block group RO") [CAUSE] Dev-replace has two source of writes: - Write duplication All writes to source device will also be duplicated to target device. Content: Not yet persisted data/meta - Scrub copy Dev-replace reused scrub code to iterate through existing extents, and copy the verified data to target device. Content: Previously persisted data and metadata The difference in contents makes the following race possible: Regular Writer \| Dev-replace ----------------------------------------------------------------- ^ \| \| Preallocate one data extent \| \| at bytenr X, len 1M \| v \| ^ Commit transaction \| \| Now extent [X, X+1M) is in \| v commit root \| ================== Dev replace starts ========================= \| ^ \| \| Scrub extent [X, X+1M) \| \| Read [X, X+1M) \| \| (The content are mostly garbage \| \| since it's preallocated) ^ \| v \| Write back happens for \| \| extent [X, X+512K) \| \| New data writes to both \| \| source and target dev. \| v \| \| ^ \| \| Scrub writes back extent [X, X+1M) \| \| to target device. \| \| This will over write the new data in \| \| [X, X+512K) \| v This race can only happen for nocow writes. Thus metadata and data cow writes are safe, as COW will never overwrite extents of previous transaction (in commit root). This behavior can be confirmed by disabling all fallocate related calls in fsstress (), then all related tests can pass a 2000 run loop. : FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \ -f collapse=0 -f punch=0 -f resvsp=0" I didn't expect resvsp ioctl will fallback to fallocate in VFS... [FIX] Make dev-replace to require mandatory block group RO, and wait for current nocow writes before calling scrub_chunk(). This patch will mostly revert commit `76a8efa171` ("btrfs: Continue replace when set_block_ro failed") for dev-replace path. The side effect is, dev-replace can be more strict on avaialble space, but definitely worth to avoid data corruption. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: `76a8efa171` ("btrfs: Continue replace when set_block_ro failed") Fixes: `b12de52896` ("btrfs: scrub: Don't check free space before marking a block group RO") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-24 14:35:56 +01:00
YueHaibing	b3531f5fc1	xfs: remove unused variable 'done' fs/xfs/xfs_inode.c: In function 'xfs_itruncate_extents_flags': fs/xfs/xfs_inode.c:1523:8: warning: unused variable 'done' [-Wunused-variable] commit `4bbb04abb4` ("xfs: truncate should remove all blocks, not just to the end of the page cache") left behind this, so remove it. Fixes: `4bbb04abb4` ("xfs: truncate should remove all blocks, not just to the end of the page cache") Reported-by: Hulk Robot <hulkci@huawei.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-01-23 21:24:50 -08:00
Darrick J. Wong	54027a4993	xfs: fix uninitialized variable in xfs_attr3_leaf_inactive Dan Carpenter pointed out that error is uninitialized. While there never should be an attr leaf block with zero entries, let's not leave that logic bomb there. Fixes: `0bb9d159bd` ("xfs: streamline xfs_attr3_leaf_inactive") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2020-01-23 16:11:32 -08:00
Linus Torvalds	fa0a4e3b54	A fix for a potential use-after-free from Jeff, marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl4p1+MTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi4YtCACPHyE8aoDTHZF8UZ9bHKNFVt4C1bRx ihFB6/PzmIfFw4Cbf+yTW85q3zqJ/6eJIOZF4dlwoFWK+osSk8sYRaOvlEovysbR sYiAbcOxePj9tSPdrWLYB/5ELtwMTloxBo7mPiJYt127UntWlPGfiz4sdHJBt1zI IBPOIeACJKGe0+Wtj0mGsXk+WhEB3nFk2DINnLuFc4tG6yXkFNq5/fnXrgVTlUTF 4EwDQgHBUIqKDJarSyIBzud6VVshS7VaMAu8h9kwPScN4sG1y4ucgFzXIc4JfqRN TnEV48hdRQMVuQtsvuzAMPQvsjMlIXUSTGZzs4XPbEBjgAP8+MP+PJvL =XVg1 -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-client Pull ceph fix from Ilya Dryomov: "A fix for a potential use-after-free from Jeff, marked for stable" * tag 'ceph-for-5.5-rc8' of https://github.com/ceph/ceph-client: ceph: hold extra reference to r_parent over life of request	2020-01-23 11:21:35 -08:00
Dave Hansen	42222eae17	mm: remove arch_bprm_mm_init() hook From: Dave Hansen <dave.hansen@linux.intel.com> MPX is being removed from the kernel due to a lack of support in the toolchain going forward (gcc). arch_bprm_mm_init() is used at execve() time. The only non-stub implementation is on x86 for MPX. Remove the hook entirely from all architectures and generic code. Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: x86@kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>	2020-01-23 10:41:16 -08:00
Linus Torvalds	3c2659bd1d	readdir: make user_access_begin() use the real access range In commit `9f79b78ef7` ("Convert filldir[64]() from __put_user() to unsafe_put_user()") I changed filldir to not do individual __put_user() accesses, but instead use unsafe_put_user() surrounded by the proper user_access_begin/end() pair. That make them enormously faster on modern x86, where the STAC/CLAC games make individual user accesses fairly heavy-weight. However, the user_access_begin() range was not really the exact right one, since filldir() has the unfortunate problem that it needs to not only fill out the new directory entry, it also needs to fix up the previous one to contain the proper file offset. It's unfortunate, but the "d_off" field in "struct dirent" is _not_ the file offset of the directory entry itself - it's the offset of the next one. So we end up backfilling the offset in the previous entry as we walk along. But since x86 didn't really care about the exact range, and used to be the only architecture that did anything fancy in user_access_begin() to begin with, the filldir[64]() changes did something lazy, and even commented on it: /* * Note! This range-checks 'previous' (which may be NULL). * The real range was checked in getdents / if (!user_access_begin(dirent, sizeof(dirent))) goto efault; and it all worked fine. But now 32-bit ppc is starting to also implement user_access_begin(), and the fact that we faked the range to only be the (possibly not even valid) previous directory entry becomes a problem, because ppc32 will actually be using the range that is passed in for more than just "check that it's user space". This is a complete rewrite of Christophe's original patch. By saving off the record length of the previous entry instead of a pointer to it in the filldir data structures, we can simplify the range check and the writing of the previous entry d_off field. No need for any conditionals in the user accesses themselves, although we retain the conditional EINTR checking for the "was this the first directory entry" signal handling latency logic. Fixes: `9f79b78ef7` ("Convert filldir[64]() from __put_user() to unsafe_put_user()") Link: https://lore.kernel.org/lkml/a02d3426f93f7eb04960a4d9140902d278cab0bb.1579697910.git.christophe.leroy@c-s.fr/ Link: https://lore.kernel.org/lkml/408c90c4068b00ea8f1c41cca45b84ec23d4946b.1579783936.git.christophe.leroy@c-s.fr/ Reported-and-tested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-23 10:15:28 -08:00
Linus Torvalds	2c6b7bcd74	readdir: be more conservative with directory entry names Commit `8a23eb804c` ("Make filldir[64]() verify the directory entry filename is valid") added some minimal validity checks on the directory entries passed to filldir[64](). But they really were pretty minimal. This fleshes out at least the name length check: we used to disallow zero-length names, but really, negative lengths or oevr-long names aren't ok either. Both could happen if there is some filesystem corruption going on. Now, most filesystems tend to use just an "unsigned char" or similar for the length of a directory entry name, so even with a corrupt filesystem you should never see anything odd like that. But since we then use the name length to create the directory entry record length, let's make sure it actually is half-way sensible. Note how POSIX states that the size of a path component is limited by NAME_MAX, but we actually use PATH_MAX for the check here. That's because while NAME_MAX is generally the correct maximum name length (it's 255, for the same old "name length is usually just a byte on disk"), there's nothing in the VFS layer that really cares. So the real limitation at a VFS layer is the total pathname length you can pass as a filename: PATH_MAX. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-23 10:05:05 -08:00
Hridya Valsaraju	fc7100ea2a	f2fs: Add f2fs stats to sysfs Currently f2fs stats are only available from /d/f2fs/status. This patch adds some of the f2fs stats to sysfs so that they are accessible even when debugfs is not mounted. The following sysfs nodes are added: -/sys/fs/f2fs/<disk>/free_segments -/sys/fs/f2fs/<disk>/cp_foreground_calls -/sys/fs/f2fs/<disk>/cp_background_calls -/sys/fs/f2fs/<disk>/gc_foreground_calls -/sys/fs/f2fs/<disk>/gc_background_calls -/sys/fs/f2fs/<disk>/moved_blocks_foreground -/sys/fs/f2fs/<disk>/moved_blocks_background -/sys/fs/f2fs/<disk>/avg_vblocks Signed-off-by: Hridya Valsaraju <hridya@google.com> [Jaegeuk Kim: allow STAT_FS without DEBUG_FS] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-01-23 09:24:25 -08:00
Filipe Manana	831d2fa25a	Btrfs: make deduplication with range including the last block work Since btrfs was migrated to use the generic VFS helpers for clone and deduplication, it stopped allowing for the last block of a file to be deduplicated when the source file size is not sector size aligned (when eof is somewhere in the middle of the last block). There are two reasons for that: 1) The generic code always rounds down, to a multiple of the block size, the range's length for deduplications. This means we end up never deduplicating the last block when the eof is not block size aligned, even for the safe case where the destination range's end offset matches the destination file's size. That rounding down operation is done at generic_remap_check_len(); 2) Because of that, the btrfs specific code does not expect anymore any non-aligned range length's for deduplication and therefore does not work if such nona-aligned length is given. This patch addresses that second part, and it depends on a patch that fixes generic_remap_check_len(), in the VFS, which was submitted ealier and has the following subject: "fs: allow deduplication of eof block into the end of the destination file" These two patches address reports from users that started seeing lower deduplication rates due to the last block never being deduplicated when the file size is not aligned to the filesystem's block size. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 18:24:07 +01:00
Filipe Manana	a5e6ea18e3	fs: allow deduplication of eof block into the end of the destination file We always round down, to a multiple of the filesystem's block size, the length to deduplicate at generic_remap_check_len(). However this is only needed if an attempt to deduplicate the last block into the middle of the destination file is requested, since that leads into a corruption if the length of the source file is not block size aligned. When an attempt to deduplicate the last block into the end of the destination file is requested, we should allow it because it is safe to do it - there's no stale data exposure and we are prepared to compare the data ranges for a length not aligned to the block (or page) size - in fact we even do the data compare before adjusting the deduplication length. After btrfs was updated to use the generic helpers from VFS (by commit `34a28e3d77` ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")) we started to have user reports of deduplication not reflinking the last block anymore, and whence users getting lower deduplication scores. The main use case is deduplication of entire files that have a size not aligned to the block size of the filesystem. We already allow cloning the last block to the end (and beyond) of the destination file, so allow for deduplication as well. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 18:20:48 +01:00
Dmitry Monakhov	4068664e3c	ext4: fix extent_status fragmentation for plain files Extents are cached in read_extent_tree_block(); as a result, extents are not cached for inodes with depth == 0 when we try to find the extent using ext4_find_extent(). The result of the lookup is cached in ext4_map_blocks() but is only a subset of the extent on disk. As a result, the contents of extents status cache can get very badly fragmented for certain workloads, such as a random 4k read workload. File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 8191: 40960.. 49151: 8192: last,eof $ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test $ perf script \| grep ext4_es_insert_extent \| head -n 10 fio 131 [000] 13.975421: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W fio 131 [000] 13.975939: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W fio 131 [000] 13.976467: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W fio 131 [000] 13.976937: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W fio 131 [000] 13.977440: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W fio 131 [000] 13.977931: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W fio 131 [000] 13.978376: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W fio 131 [000] 13.978957: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W fio 131 [000] 13.979474: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W Fix this by caching the extents for inodes with depth == 0 in ext4_find_extent(). [ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this newly added function is not in extents_cache.c, and to avoid potential visual confusion with ext4_es_cache_extent(). -TYT ] Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-01-23 12:02:15 -05:00
Josef Bacik	4e19443da1	btrfs: free block groups after free'ing fs trees Sometimes when running generic/475 we would trip the WARN_ON(cache->reserved) check when free'ing the block groups on umount. This is because sometimes we don't commit the transaction because of IO errors and thus do not cleanup the tree logs until at umount time. These blocks are still reserved until they are cleaned up, but they aren't cleaned up until _after_ we do the free block groups work. Fix this by moving the free after free'ing the fs roots, that way all of the tree logs are cleaned up and we have a properly cleaned fs. A bunch of loops of generic/475 confirmed this fixes the problem. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:39 +01:00
Nikolay Borisov	1362089d2a	btrfs: Fix split-brain handling when changing FSID to metadata uuid Current code doesn't correctly handle the situation which arises when a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID changed to the one in metadata uuid. This causes the incompat flag to disappear. In case of a power failure we could end up in a situation where part of the disks in a multi-disk filesystem are correctly reverted to METADATA_UUID_INCOMPAT flag unset state, while others have METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS. This patch corrects the behavior required to handle the case where a disk of the second type is scanned first, creating the necessary btrfs_fs_devices. Subsequently, when a disk which has already completed the transition is scanned it should overwrite the data in btrfs_fs_devices. Reported-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:39 +01:00
Nikolay Borisov	0584071014	btrfs: Handle another split brain scenario with metadata uuid feature There is one more cases which isn't handled by the original metadata uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and the user decides to change the FSID to the original one e.g. have metadata_uuid and fsid match. In case of power failure while this operation is in progress we could end up in a situation where some of the disks have the incompat bit removed and the other half have both METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags. This patch handles the case where a disk that has successfully changed its FSID such that it equals METADATA_UUID is scanned first. Subsequently when a disk with both METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned find_fsid_changed won't be able to find an appropriate btrfs_fs_devices. This is done by extending find_fsid_changed to correctly find btrfs_fs_devices whose metadata_uuid/fsid are the same and they match the metadata_uuid of the currently scanned device. Fixes: `cc5de4e702` ("btrfs: Handle final split-brain possibility during fsid change") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reported-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:38 +01:00
Su Yue	c6730a0e57	btrfs: Factor out metadata_uuid code from find_fsid. find_fsid became rather hairy with the introduction of metadata uuid changing feature. Alleviate this by factoring out the metadata uuid specific code in a dedicated function which deals with finding correct fsid for a device with changed uuid. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:38 +01:00
Su Yue	c0d81c7cb2	btrfs: Call find_fsid from find_fsid_inprogress Since find_fsid_inprogress should also handle the case in which an fs didn't change its FSID make it call find_fsid directly. This makes the code in device_list_add simpler by eliminating a conditional call of find_fsid. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:37 +01:00
Filipe Manana	b5e4ff9d46	Btrfs: fix infinite loop during fsync after rename operations Recently fsstress (from fstests) sporadically started to trigger an infinite loop during fsync operations. This turned out to be because support for the rename exchange and whiteout operations was added to fsstress in fstests. These operations, unlike any others in fsstress, cause file names to be reused, whence triggering this issue. However it's not necessary to use rename exchange and rename whiteout operations trigger this issue, simple rename operations and file creations are enough to trigger the issue. The issue boils down to when we are logging inodes that conflict (that had the name of any inode we need to log during the fsync operation), we keep logging them even if they were already logged before, and after that we check if there's any other inode that conflicts with them and then add it again to the list of inodes to log. Skipping already logged inodes fixes the issue. Consider the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir # inode 257 $ touch /mnt/testdir/zz # inode 258 $ ln /mnt/testdir/zz /mnt/testdir/zz_link $ touch /mnt/testdir/a # inode 259 $ sync # The following 3 renames achieve the same result as a rename exchange # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a). $ mv /mnt/testdir/a /mnt/testdir/a/tmp $ mv /mnt/testdir/zz_link /mnt/testdir/a $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link # The following rename and file creation give the same result as a # rename whiteout operation (<rename_whiteout> zz to a2). $ mv /mnt/testdir/zz /mnt/testdir/a2 $ touch /mnt/testdir/zz # inode 260 $ xfs_io -c fsync /mnt/testdir/zz --> results in the infinite loop The following steps happen: 1) When logging inode 260, we find that its reference named "zz" was used by inode 258 in the previous transaction (through the commit root), so inode 258 is added to the list of conflicting indoes that need to be logged; 2) After logging inode 258, we find that its reference named "a" was used by inode 259 in the previous transaction, and therefore we add inode 259 to the list of conflicting inodes to be logged; 3) After logging inode 259, we find that its reference named "zz_link" was used by inode 258 in the previous transaction - we add inode 258 to the list of conflicting inodes to log, again - we had already logged it before at step 3. After logging it again, we find again that inode 259 conflicts with him, and we add again 259 to the list, etc - we end up repeating all the previous steps. So fix this by skipping logging of conflicting inodes that were already logged. Fixes: `6b5fc433a7` ("Btrfs: fix fsync after succession of renames of different files") CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:37 +01:00
Josef Bacik	d62b23c949	btrfs: set trans->drity in btrfs_commit_transaction If we abort a transaction we have the following sequence if (!trans->dirty && list_empty(&trans->new_bgs)) return; WRITE_ONCE(trans->transaction->aborted, err); The idea being if we didn't modify anything with our trans handle then we don't really need to abort the whole transaction, maybe the other trans handles are fine and we can carry on. However in the case of create_snapshot we add a pending_snapshot object to our transaction and then commit the transaction. We don't actually modify anything. sync() behaves the same way, attach to an existing transaction and commit it. This means that if we have an IO error in the right places we could abort the committing transaction with our trans->dirty being not set and thus not set transaction->aborted. This is a problem because in the create_snapshot() case we depend on pending->error being set to something, or btrfs_commit_transaction returning an error. If we are not the trans handle that gets to commit the transaction, and we're waiting on the commit to happen we get our return value from cur_trans->aborted. If this was not set to anything because sync() hit an error in the transaction commit before it could modify anything then cur_trans->aborted would be 0. Thus we'd return 0 from btrfs_commit_transaction() in create_snapshot. This is a problem because we then try to do things with pending_snapshot->snap, which will be NULL because we didn't create the snapshot, and then we'll get a NULL pointer dereference like the following "BUG: kernel NULL pointer dereference, address: 00000000000001f0" RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330 Call Trace: ? btrfs_mksubvol.isra.31+0x3f2/0x510 btrfs_mksubvol.isra.31+0x4bc/0x510 ? __sb_start_write+0xfa/0x200 ? mnt_want_write_file+0x24/0x50 btrfs_ioctl_snap_create_transid+0x16c/0x1a0 btrfs_ioctl_snap_create_v2+0x11e/0x1a0 btrfs_ioctl+0x1534/0x2c10 ? free_debug_processing+0x262/0x2a3 do_vfs_ioctl+0xa6/0x6b0 ? do_sys_open+0x188/0x220 ? syscall_trace_enter+0x1f8/0x330 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4a/0x1b0 In order to fix this we need to make sure anybody who calls commit_transaction has trans->dirty set so that they properly set the trans->transaction->aborted value properly so any waiters know bad things happened. This was found while I was running generic/475 with my modified fsstress, it reproduced within a few runs. I ran with this patch all night and didn't see the problem again. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:37 +01:00
Josef Bacik	889bfa3908	btrfs: drop log root for dropped roots If we fsync on a subvolume and create a log root for that volume, and then later delete that subvolume we'll never clean up its log root. Fix this by making switch_commit_roots free the log for any dropped roots we encounter. The extra churn is because we need a btrfs_trans_handle, not the btrfs_transaction. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:36 +01:00
Anand Jain	668e48af7a	btrfs: sysfs, add devid/dev_state kobject and device attributes New sysfs attributes that track the filesystem status of devices, stored in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's a directory for each device, with name corresponding to the numerical device id. in_fs_metadata - device is in the list of fs metadata missing - device is missing (no device node or block device) replace_target - device is target of replace writeable - writes from fs are allowed These attributes reflect the state of the device::dev_state and created at mount time. Sample output: $ pwd /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/ $ ls in_fs_metadata missing replace_target writeable $ cat missing 0 The output from these attributes are 0 or 1. 0 indicates unset and 1 indicates set. These attributes are readonly. It is observed that the device delete thread and sysfs read thread will not race because the delete thread calls sysfs kobject_put() which in turn waits for existing sysfs read to complete. Note for device replace devid swap: During the replace the target device temporarily assumes devid 0 before assigning the devid of the soruce device. In btrfs_dev_replace_finishing() we remove source sysfs devid using the function btrfs_sysfs_remove_devices_attr(), so after that call kobject_rename() to update the devid in the sysfs. This adds and calls btrfs_sysfs_update_devid() helper function to update the device id. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:36 +01:00
Nikolay Borisov	1776ad172e	btrfs: Refactor btrfs_rmap_block to improve readability Move variables to appropriate scope. Remove last BUG_ON in the function and rework error handling accordingly. Make the duplicate detection code more straightforward. Use in_range macro. And give variables more descriptive name by explicitly distinguishing between IO stripe size (size recorded in the chunk item) and data stripe size (the size of an actual stripe, constituting a logical chunk/block group). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:35 +01:00
Nikolay Borisov	bf2e2eb060	btrfs: Add self-tests for btrfs_rmap_block Add RAID1 and single testcases to verify that data stripes are excluded from super block locations and that the address mapping is valid. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:35 +01:00
Nikolay Borisov	b3ad2c17fd	btrfs: selftests: Add support for dummy devices Add basic infrastructure to create and link dummy btrfs_devices. This will be used in the pending btrfs_rmap_block test which deals with the block groups. Calling btrfs_alloc_dummy_device will link the newly created device to the passed fs_info and the test framework will free them once the test is finished. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:34 +01:00
Nikolay Borisov	96a14336bd	btrfs: Move and unexport btrfs_rmap_block It's used only during initial block group reading to map physical address of super block to a list of logical ones. Make it private to block-group.c, add proper kernel doc and ensure it's exported only for tests. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:34 +01:00
David Sterba	68c467cbb2	btrfs: separate definition of assertion failure handlers There's a report where objtool detects unreachable instructions, eg.: fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction This seems to be a false positive due to compiler version. The cause is in the ASSERT macro implementation that does the conditional check as IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef. To avoid that, use the ifdefs directly. There are still 2 reports that aren't fixed: fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David Sterba <dsterba@suse.com>	2020-01-23 17:24:23 +01:00
Daniel Rosenberg	edc440e3d2	fscrypt: improve format of no-key names When an encrypted directory is listed without the key, the filesystem must show "no-key names" that uniquely identify directory entries, are at most 255 (NAME_MAX) bytes long, and don't contain '/' or '\0'. Currently, for short names the no-key name is the base64 encoding of the ciphertext filename, while for long names it's the base64 encoding of the ciphertext filename's dirhash and second-to-last 16-byte block. This format has the following problems: - Since it doesn't always include the dirhash, it's incompatible with directories that will use a secret-keyed dirhash over the plaintext filenames. In this case, the dirhash won't be computable from the ciphertext name without the key, so it instead must be retrieved from the directory entry and always included in the no-key name. Casefolded encrypted directories will use this type of dirhash. - It's ambiguous: it's possible to craft two filenames that map to the same no-key name, since the method used to abbreviate long filenames doesn't use a proper cryptographic hash function. Solve both these problems by switching to a new no-key name format that is the base64 encoding of a variable-length structure that contains the dirhash, up to 149 bytes of the ciphertext filename, and (if any bytes remain) the SHA-256 of the remaining bytes of the ciphertext filename. This ensures that each no-key name contains everything needed to find the directory entry again, contains only legal characters, doesn't exceed NAME_MAX, is unambiguous unless there's a SHA-256 collision, and that we only take the performance hit of SHA-256 on very long filenames. Note: this change does not address the existing issue where users can modify the 'dirhash' part of a no-key name and the filesystem may still accept the name. Signed-off-by: Daniel Rosenberg <drosen@google.com> [EB: improved comments and commit message, fixed checking return value of base64_decode(), check for SHA-256 error, continue to set disk_name for short names to keep matching simpler, and many other cleanups] Link: https://lore.kernel.org/r/20200120223201.241390-7-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:50:03 -08:00
Eric Biggers	aec992aab8	ubifs: allow both hash and disk name to be provided in no-key names In order to support a new dirhash method that is a secret-keyed hash over the plaintext filenames (which will be used by encrypted+casefolded directories on ext4 and f2fs), fscrypt will be switching to a new no-key name format that always encodes the dirhash in the name. UBIFS isn't happy with this because it has assertions that verify that either the hash or the disk name is provided, not both. Change it to use the disk name if one is provided, even if a hash is available too; else use the hash. Link: https://lore.kernel.org/r/20200120223201.241390-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:49:56 -08:00
Eric Biggers	f0d07a98a0	ubifs: don't trigger assertion on invalid no-key filename If userspace provides an invalid fscrypt no-key filename which encodes a hash value with any of the UBIFS node type bits set (i.e. the high 3 bits), gracefully report ENOENT rather than triggering ubifs_assert(). Test case with kvm-xfstests shell: . fs/ubifs/config . ~/xfstests/common/encrypt dev=$(__blkdev_to_ubi_volume /dev/vdc) ubiupdatevol $dev -t mount $dev /mnt -t ubifs mkdir /mnt/edir xfs_io -c set_encpolicy /mnt/edir rm /mnt/edir/_,,,,,DAAAAAAAAAAAAAAAAAAAAAAAAAA With the bug, the following assertion fails on the 'rm' command: [ 19.066048] UBIFS error (ubi0:0 pid 379): ubifs_assert_failed: UBIFS assert failed: !(hash & ~UBIFS_S_KEY_HASH_MASK), in fs/ubifs/key.h:170 Fixes: `f4f61d2cc6` ("ubifs: Implement encrypted filenames") Cc: <stable@vger.kernel.org> # v4.10+ Link: https://lore.kernel.org/r/20200120223201.241390-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:49:56 -08:00
Eric Biggers	f592efe735	fscrypt: clarify what is meant by a per-file key Now that there's sometimes a second type of per-file key (the dirhash key), clarify some function names, macros, and documentation that specifically deal with per-file encryption keys. Link: https://lore.kernel.org/r/20200120223201.241390-4-ebiggers@kernel.org Reviewed-by: Daniel Rosenberg <drosen@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:49:56 -08:00
Daniel Rosenberg	aa408f835d	fscrypt: derive dirhash key for casefolded directories When we allow indexed directories to use both encryption and casefolding, for the dirhash we can't just hash the ciphertext filenames that are stored on-disk (as is done currently) because the dirhash must be case insensitive, but the stored names are case-preserving. Nor can we hash the plaintext names with an unkeyed hash (or a hash keyed with a value stored on-disk like ext4's s_hash_seed), since that would leak information about the names that encryption is meant to protect. Instead, if we can accept a dirhash that's only computable when the fscrypt key is available, we can hash the plaintext names with a keyed hash using a secret key derived from the directory's fscrypt master key. We'll use SipHash-2-4 for this purpose. Prepare for this by deriving a SipHash key for each casefolded encrypted directory. Make sure to handle deriving the key not only when setting up the directory's fscrypt_info, but also in the case where the casefold flag is enabled after the fscrypt_info was already set up. (We could just always derive the key regardless of casefolding, but that would introduce unnecessary overhead for people not using casefolding.) Signed-off-by: Daniel Rosenberg <drosen@google.com> [EB: improved commit message, updated fscrypt.rst, squashed with change that avoids unnecessarily deriving the key, and many other cleanups] Link: https://lore.kernel.org/r/20200120223201.241390-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:49:55 -08:00
Daniel Rosenberg	6e1918cfb2	fscrypt: don't allow v1 policies with casefolding Casefolded encrypted directories will use a new dirhash method that requires a secret key. If the directory uses a v2 encryption policy, it's easy to derive this key from the master key using HKDF. However, v1 encryption policies don't provide a way to derive additional keys. Therefore, don't allow casefolding on directories that use a v1 policy. Specifically, make it so that trying to enable casefolding on a directory that has a v1 policy fails, trying to set a v1 policy on a casefolded directory fails, and trying to open a casefolded directory that has a v1 policy (if one somehow exists on-disk) fails. Signed-off-by: Daniel Rosenberg <drosen@google.com> [EB: improved commit message, updated fscrypt.rst, and other cleanups] Link: https://lore.kernel.org/r/20200120223201.241390-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:47:15 -08:00
Eric Biggers	1b3b827ee5	fscrypt: add "fscrypt_" prefix to fname_encrypt() fname_encrypt() is a global function, due to being used in both fname.c and hooks.c. So it should be prefixed with "fscrypt_", like all the other global functions in fs/crypto/. Link: https://lore.kernel.org/r/20200120071736.45915-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:45:10 -08:00
Eric Biggers	13a10da946	fscrypt: don't print name of busy file when removing key When an encryption key can't be fully removed due to file(s) protected by it still being in-use, we shouldn't really print the path to one of these files to the kernel log, since parts of this path are likely to be encrypted on-disk, and (depending on how the system is set up) the confidentiality of this path might be lost by printing it to the log. This is a trade-off: a single file path often doesn't matter at all, especially if it's a directory; the kernel log might still be protected in some way; and I had originally hoped that any "inode(s) still busy" bugs (which are security weaknesses in their own right) would be quickly fixed and that to do so it would be super helpful to always know the file path and not have to run 'find dir -inum $inum' after the fact. But in practice, these bugs can be hard to fix (e.g. due to asynchronous process killing that is difficult to eliminate, for performance reasons), and also not tied to specific files, so knowing a file path doesn't necessarily help. So to be safe, for now let's just show the inode number, not the path. If someone really wants to know a path they can use 'find -inum'. Fixes: `b1c0ec3599` ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl") Cc: <stable@vger.kernel.org> # v5.4+ Link: https://lore.kernel.org/r/20200120060732.390362-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-01-22 14:45:08 -08:00
Pavel Begunkov	86a761f81e	io_uring: honor IOSQE_ASYNC for linked reqs REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 13:57:48 -07:00
Pavel Begunkov	1118591ab8	io_uring: prep req when do IOSQE_ASYNC Whenever IOSQE_ASYNC is set, requests will be punted to async without getting into io_issue_req() and without proper preparation done (e.g. io_req_defer_prep()). Hence they will be left uninitialised. Prepare them before punting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 13:57:46 -07:00
Linus Torvalds	dbab40bdb4	io_uring-5.5-2020-01-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4obx8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqNwD/99ae+ezi7LSVj9zQml7y/6ZSV4D3wzD9PJ 7QsUq5kGA0tisZ9q/rd0eja4Fy2Dw/qhX+GXgTYLt9+a66rp0CskWaD9NWFMtFGp eBgitruw5SqFl8GfNCjd6NB/Af3NGyrmQSPV58K7mma6zQX7ELCrEdipCKj5QpNk eHO0enZA1KcPliegAbDQfhz7U9frns9nSs0VHB599X9jr5pi8PPejukVEBwK67o6 Dh52CDqjeKksX8PWxhXau9j8DNt85Zs8ocRFvgWD8/UQSLcHAM6DFLdkeGEHQu9H QouW9JRFzmTksy3KvPnCcCPdsYQrqVmj6fCCg6AXW3yOzcI1IvhO+hcNMQtkLFWI 5JKYZkFhGjCsypmkYpB+5mqcz+fsbkfgN9clU1tvPK8FcmpLolsIQrUJdaRe6r58 odDe9Qs+I46LAYKttkkAlpYg1E9CD0T7g1ENXzcqb5t6fZTW4oU0Wpqen788WQqz EQqp30kU0FgnFAW8BUpJK5iwrrm3RrS+Br6lhk33BeA423Pt6n3RnXYFVvtAHeuA jyUVqiMKexi+7fCC2LO1M9xofQMmr6z2nVkZNhDLIr4y9uxD4xTyiaEAFjk6Lws6 lcSWZMHQPKaCqfxhAtnoVZP96k6zMwfEJUb+fANX9SI0+3p9LHFz2Kp/AOs6GvJC /A5vCFjLWw== =oNaB -----END PGP SIGNATURE----- Merge tag 'io_uring-5.5-2020-01-22' of git://git.kernel.dk/linux-block Pull io_uring fix from Jens Axboe: "This was supposed to have gone in last week, but due to a brain fart on my part, I forgot that we made this struct addition in the 5.5 cycle. So here it is for 5.5, to prevent having a 32 vs 64-bit compatability issue with the files_update command" * tag 'io_uring-5.5-2020-01-22' of git://git.kernel.dk/linux-block: io_uring: fix compat for IORING_REGISTER_FILES_UPDATE	2020-01-22 08:30:09 -08:00
Jeff Layton	9c1c2b35f1	ceph: hold extra reference to r_parent over life of request Currently, we just assume that it will stick around by virtue of the submitter's reference, but later patches will allow the syscall to return early and we can't rely on that reference at that point. While I'm not aware of any reports of it, Xiubo pointed out that this may fix a use-after-free. If the wait for a reply times out or is canceled via signal, and then the reply comes in after the syscall returns, the client can end up trying to access r_parent without a reference. Take an extra reference to the inode when setting r_parent and release it when releasing the request. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-01-21 19:02:37 +01:00
Alex Shi	154a4dcfc9	fs/reiserfs: remove unused macros these macros are never used from introduced. better to remove them. Link: https://lore.kernel.org/r/1579602338-57079-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Bharath Vedartham <linux.bhar@gmail.com> Cc: Hariprasad Kelam <hariprasad.kelam@gmail.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: zhengbin <zhengbin13@huawei.com> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: reiserfs-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2020-01-21 17:23:05 +01:00
Alex Shi	ed21c58eef	fs/quota: remove unused macro __QUOTA_V2_PARANOIA macro is never used. better to remove it. Link: https://lore.kernel.org/r/1579602334-57039-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jan Kara <jack@suse.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2020-01-21 17:22:00 +01:00
Alex Shi	c04f2e0dd5	gfs2: remove unused LBIT macros Since commit `223b2b889f` ("GFS2: Fix alignment issue and tidy gfs2_bitfit"), these 3 macros aren't used anymore, so remove them. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-01-21 11:19:45 +01:00
Alex Shi	b3ca4e447d	fs/gfs2: remove unused IS_DINODE and IS_LEAF macros Since commit `1579343a73` ("GFS2: Remove dirent_first() function"), these macros aren't used any more, so remove them. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-01-21 11:19:38 +01:00
Gao Xiang	1e4a295567	erofs: clean up z_erofs_submit_queue() A label and extra variables will be eliminated, which is more cleaner. Link: https://lore.kernel.org/r/20200121064819.139469-1-gaoxiang25@huawei.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-21 16:46:23 +08:00
Gao Xiang	587a67b777	erofs: fold in postsubmit_is_all_bypassed() No need to introduce such separated helper since cache strategy compile configs were changed into runtime options instead in v5.4. No logic changes. Link: https://lore.kernel.org/r/20200121064747.138987-1-gaoxiang25@huawei.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>	2020-01-21 16:46:17 +08:00
Russell King	25e5d4df3b	fs/adfs: mostly divorse inode number from indirect disc address Avoid using the inode number as the indirect disc address, even though these currently have the same value. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	08ead1b8b9	fs/adfs: super: add support for E and E+ floppy image formats Add support for ADFS E and E+ floppy image formats, which, unlike their hard disk variants, do not have a filesystem boot block - they have a single map zone, with the map fragment stored at sector 0. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	e3858e125b	fs/adfs: super: extract filesystem block probe Separate the filesystem block probing from the superblock filling so we can support other ADFS filesystem formats, such as the single-zone E and E+ floppy image formats which do not have a boot block. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	ccbc80a89d	fs/adfs: dir: remove debug in adfs_dir_update() Remove the noisy debug in adfs_dir_update(). Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	f352064275	fs/adfs: super: fix inode dropping When we have write support enabled, we must not drop inodes before they have been written back, otherwise we lose updates to the filesystem on umount. Keep the inodes around unless we are built in read-only mode, or we are mounted read-only. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	a464152f2e	fs/adfs: bigdir: implement directory update support Implement big directory entry update support in the same way that we do for new directories. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	d79288b4f6	fs/adfs: bigdir: calculate and validate directory checkbyte When reading a big directory, calculate the validate the directory checkbyte to ensure that the directory contents are valid. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	aa3d4e0152	fs/adfs: bigdir: directory validation strengthening Strengthen the directory validation by ensuring that the header fields contain sensible values that fit inside the directory, and limit the directory size to 4MB as per RISC OS requirements. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	6674ecab90	fs/adfs: bigdir: extract directory validation Extract the directory validation from the directory reading function as we will want to re-use this code. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	0db35a02a1	fs/adfs: bigdir: factor out directory entry offset calculation Factor out the directory entry byte offset calculation. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	aacc954c1b	fs/adfs: newdir: split out directory commit from update After changing a directory, we need to update the sequence numbers and calculate the new check byte before the directory is scheduled to be written back to the media. Since this needs to happen for any change to the directory, move this into a separate method. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	cc625ccd0e	fs/adfs: newdir: clean up adfs_f_update() __adfs_dir_put() and adfs_dir_find_entry() are only called from adfs_f_update(), so move them into this function, removing some unnecessary entry copying by doing so. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:42 -05:00
Russell King	9318731bec	fs/adfs: newdir: merge adfs_dir_read() into adfs_f_read() adfs_dir_read() is only called from adfs_f_read(), so merge it into that function. As new directories are always 2048 bytes in size, (which we rely on elsewhere) we can consolidate some of the code. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	7a0e4048bf	fs/adfs: newdir: improve directory validation Check that the lastmask and reserved fields are all zero, as per the documentation. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	ffc8df347e	fs/adfs: newdir: factor out directory format validation We have two locations where we validate the new directory format, so factor this out to a helper. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	016936b321	fs/adfs: dir: use pointers to access directory head/tails Add and use pointers in the adfs_dir structure to access the directory head and tail structures, which will always be contiguous in a buffer. This allows us to avoid memcpy()ing the data in the new directory code, making it slightly more efficient. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	4287e4deb1	fs/adfs: dir: add more efficient iterate() per-format method Rather than using setpos + getnext to iterate through the directory entries, pass iterate() down to the dir format code to populate the dirents. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	cdc46e99e1	fs/adfs: dir: switch to iterate_shared method There is nothing in our readdir (aka iterate) method that relies on the directory inode being exclusively locked, so switch to using the iterate_shared() hook rather than iterate(). Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	4a0a88b666	fs/adfs: dir: improve compiler coverage in adfs_dir_update Get rid of the ifdef, using IS_ENABLED() instead to detect whether the code should be callable. This allows the compiler to always parse the following code, reducing the chances of errors being missed. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	f6075c7907	fs/adfs: dir: improve update failure handling When we update a directory, a number of errors may happen. If we failed to find the entry to update, we can just release the directory buffers as normal. However, if we have some other error, we may have partially updated the buffers, resulting in an invalid directory. In this case, we need to discard the buffers to avoid writing the contents back to the media, and later re-read the directory from the media. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	ae5df41390	fs/adfs: dir: modernise on-disk directory structures Use __u8 and pack the structures for on-disk directories. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	deed1bfd15	fs/adfs: dir: update directory locking Update directory locking such that it covers the validation of the directory, which could fail if another thread is concurrently writing to the same directory. Since we may sleep, we need to use a rwsem rather than a rw spinlock. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	c3c8149b35	fs/adfs: dir: add helper to mark directory buffers dirty Provide a helper for marking directory buffers dirty so they get written back to disk. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	90011c7ad9	fs/adfs: dir: add helper to read directory using inode Add a helper to read a directory using the inode, which we do in two places. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	419a6e5e82	fs/adfs: dir: add generic directory reading Both directory formats code the mechanics of fetching the directory buffers using their own implementations. Consolidate these into one implementation. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	a317120bf7	fs/adfs: dir: add generic copy functions Directories can span multiple buffers, and we currently open-code memcpy access to these buffers, including dealing with entries that are split across multiple buffers. Such code exists in both directory format implementations. Provide common functions to allow data to be copied from/to the directory buffers as if they were a contiguous set of buffers, and use them when accessing directories. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	acf5f0be8a	fs/adfs: dir: add common directory sync method adfs_fplus_sync() can be used for both directory formats since we now have a common way to access the buffer heads, so move it into dir.c and appropriately rename it. Remove the directory-format specific implementations. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	1dd9f5babf	fs/adfs: dir: add common directory buffer release method With the bhs pointer in place, we have no need for separate per-format free() methods, since a generic version will do. Provide a generic implementation, remove the format specific implementations and the method function pointer. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:41 -05:00
Russell King	95fbadbb55	fs/adfs: dir: add common dir object initialisation Initialise the dir object before we pass it down to the directory format specific read handler. This allows us to get rid of the initialisation inside those handlers. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	71b2612776	fs/adfs: dir: rename bh_fplus to bhs Rename bh_fplus to bhs in preparation to make some of the directory handling code sharable between implementations. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	f93793fd73	fs/adfs: map: fix map scanning When scanning the map for a fragment id, we need to keep track of the free space links, so we don't inadvertently believe that the freespace link is a valid fragment id. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	f6f14a0d71	fs/adfs: map: move map-specific sb initialisation to map.c Move map specific superblock initialisation to map.c, rather than having it spread into super.c. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	792314f8b2	fs/adfs: map: use find_next_bit_le() rather than open coding it Use find_next_bit_le() to find the end of a fragment in the map rather than open-coding this functionality. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	197ba3c519	fs/adfs: map: incorporate map offsets into layout lookup_zone() and scan_free_map() cope in different ways with the location of the map data within a zone: 1. lookup_zone() adds a four byte offset to the map data pointer to skip over the check and free link bytes. 2. scan_free_map() needs to use the free link pointer, which is an offset from itself, so we end up adding a 32-bit offset to the end pointer (aka mapsize) which is really confusing. Rename mapsize to endbit as this is really what it is, and incorporate the 32-bit offset into the map layout. This means that both dm_startbit and dm_endbit are now bit offsets from the start of the buffer, rather than four bytes in to the buffer. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	7b19526762	fs/adfs: map: factor out map cleanup We have several places which deal with releasing the map buffers and freeing the map array. Provide a helper for this. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	6092b6be30	fs/adfs: map: break up adfs_read_map() Split up adfs_read_map() into separate helpers to layout the map, read the map, and release the map buffers. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	e6160e469f	fs/adfs: map: rename adfs_map_free() to adfs_map_statfs() adfs_map_free() is not obvious whether it is freeing the map or returning the number of free blocks on the filesystem. Rename it to the more generic statfs() to make it clear that it's a statistic function. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	f75d398d6e	fs/adfs: map: move map reading and validation to map.c Keep all the map code together in map.c, rather than having some in super.c Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	81916245ce	fs/adfs: inode: fix adfs_mode2atts() Fix adfs_mode2atts() to actually update the file permissions on the media rather than using the current inode mode. Note also that directories do not have read/write permissions stored on the media. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Russell King	eeeb9dd98e	fs/adfs: inode: update timestamps to centisecond precision Despite ADFS timestamps having centi-second granularity, and Linux gaining fine-grained timestamp support in v2.5.48, fs/adfs was never updated. Update fs/adfs to centi-second support, and ensure that the inode ctime always reflects what is written in underlying media. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-01-20 20:12:40 -05:00
Pavel Begunkov	0463b6c58e	io_uring: use labeled array init in io_op_defs Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	6b47ee6eca	io_uring: optimise sqe-to-req flags translation For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG) req->flags \|= REQ_F_FLAG Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	87987898a1	io_uring: remove REQ_F_IO_DRAINED A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Jens Axboe	e46a7950d3	io_uring: file switch work needs to get flushed on exit We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Pavel Begunkov	b14cca0c84	io_uring: hide uring_fd in ctx req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Pavel Begunkov	0791015837	io_uring: remove extra check in __io_commit_cqring __io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Pavel Begunkov	711be0312d	io_uring: optimise use of ctx->drain_next Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request. No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	66f4af93da	io_uring: add support for probing opcodes The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	10fef4bebf	io_uring: account fixed file references correctly in batch We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all. Also ensure we free requests properly between inflight accounted and normal requests. Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov <9erthalion6@gmail.com> Reported-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	354420f705	io_uring: add opcode to issue trace event For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	cebdb98617	io_uring: add support for IORING_OP_OPENAT2 Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f8748881b1	io_uring: remove 'fname' from io_open structure We only use it internally in the prep functions for both statx and openat, so we don't need it to be persistent across the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	c12cedf24e	io_uring: add 'struct open_how' to the openat request context We'll need this for openat2(2) support, remove flags and mode from the existing io_open struct. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f2842ab5b7	io_uring: enable option to only trigger eventfd for async completions If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <markuspapadakis@icloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	69b3e54613	io_uring: change io_ring_ctx bool fields into bit fields In preparation for adding another one, which would make us spill into another long (and hence bump the size of the ctx), change them to bit fields. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	c150368b49	io_uring: file set registration should use interruptible waits If an application attempts to register a set with unbounded requests pending, we can be stuck here forever if they don't complete. We can make this wait interruptible, and just abort if we get signaled. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
YueHaibing	96fd84d83a	io_uring: Remove unnecessary null check Null check kfree is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	fddafacee2	io_uring: add support for send(2) and recv(2) This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	2550878f84	io_uring: remove extra io_wq_current_is_worker() io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	caf582c652	io_uring: optimise commit_sqring() for common case It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Pavel Begunkov	ee7d46d9db	io_uring: optimise head checks in io_get_sqring() A user may ask to submit more than there is in the ring, and then io_uring will submit as much as it can. However, in the last iteration it will allocate an io_kiocb and immediately free it. It could do better and adjust @to_submit to what is in the ring. And since the ring's head is already checked here, there is no need to do it in the loop, spamming with smp_load_acquire()'s barriers Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Pavel Begunkov	9ef4f12489	io_uring: clamp to_submit in io_submit_sqes() Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated code and prepares for following changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	8110c1a621	io_uring: add support for IORING_SETUP_CLAMP Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c6ca97b30c	io_uring: extend batch freeing to cover more cases Currently we only batch free if fixed files are used, no links, no aux data, etc. This extends the batch freeing to only exclude the linked case and fallback case, and make io_free_req_many() handle the other cases just fine. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	8237e04598	io_uring: wrap multi-req freeing in struct req_batch This cleans up the code a bit, and it allows us to build on top of the multi-req freeing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Pavel Begunkov	2b85edfc0c	io_uring: batch getting pcpu references percpu_ref_tryget() has its own overhead. Instead getting a reference for each request, grab a bunch once per io_submit_sqes(). ~5% throughput boost for a "submit and wait 128 nops" benchmark. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> __io_req_free_empty() -> __io_req_do_free() Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c1ca757bd6	io_uring: add IORING_OP_MADVISE This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	4840e418c2	io_uring: add IORING_OP_FADVISE This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:01 -07:00
Jens Axboe	ba04291eb6	io_uring: allow use of offset == -1 to mean file position This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	3a6820f2bb	io_uring: add non-vectored read/write commands For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	e94f141bd2	io_uring: improve poll completion performance For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ad3eb2c89f	io_uring: split overflow state into SQ and CQ side We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side. This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	d3656344fe	io_uring: add lookup table for various opcode needs We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added. This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	add7b6b85a	io_uring: remove two unnecessary function declarations __io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Pavel Begunkov	32fe525b6d	io_uring: move *queue_link_head() from common path Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Pavel Begunkov	9d76377f7e	io_uring: rename prev to head Calling "prev" a head of a link is a bit misleading. Rename it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ce35a47a3a	io_uring: add IOSQE_ASYNC io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	895e2ca0f6	io-wq: support concurrent non-blocking work io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast. Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	eddc7ef52a	io_uring: add support for IORING_OP_STATX This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:54 -07:00

1 2 3 4 5 ...

62461 Commits