Add support for direct I/O on encrypted files when blk-crypto (inline
encryption) is being used for file contents encryption.
There will be a merge conflict with the block pull request in
fs/iomap/direct-io.c, due to some bio interface cleanups. The merge
resolution is straightforward and can be found in linux-next.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYjgCfhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6vjAQDp8D8OKIyj67KjYwvyHpNy0fZhxeur
RexC0nDfd9BE/AD/fV6zpCglmsuGxqxL0jmqeczKXn2y7nRFmPciCBTi/wY=
=kwNd
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"Add support for direct I/O on encrypted files when blk-crypto (inline
encryption) is being used for file contents encryption"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: update documentation for direct I/O support
f2fs: support direct I/O with fscrypt using blk-crypto
ext4: support direct I/O with fscrypt using blk-crypto
iomap: support direct I/O with fscrypt using blk-crypto
fscrypt: add functions for direct I/O support
Add validation check for JFS_IP(ipimap)->i_imap to prevent a NULL deref
in diFree since diFree uses it without do any validations.
When function jfs_mount calls diMount to initialize fileset inode
allocation map, it can fail and JFS_IP(ipimap)->i_imap won't be
initialized. Then it calls diFreeSpecial to close fileset inode allocation
map inode and it will flow into jfs_evict_inode. Function jfs_evict_inode
just validates JFS_SBI(inode->i_sb)->ipimap, then calls diFree. diFree use
JFS_IP(ipimap)->i_imap directly, then it will cause a NULL deref.
Reported-by: TCS Robot <tcs_robot@tencent.com>
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Syzbot reported divide error in dbNextAG(). The problem was in missing
validation check for malicious image.
Syzbot crafted an image with bmp->db_numag equal to 0. There wasn't any
validation checks, but dbNextAG() blindly use bmp->db_numag in divide
expression
Fix it by validating bmp->db_numag in dbMount() and return an error if
image is malicious
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+46f5c25af73eb8330eb6@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
In the very rare case where the readdir reply contains multiple cookies
that map to the same hash value, we can end up deadlocking waiting for a
page lock that we already hold. In this case we should fail the page
lock by using grab_cache_page_nowait().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When I/O fails in one of the currently connected DFS targets, retry it
from other targets as specified in MS-DFSC "3.1.5.2 I/O Operation to
+Target Fails with an Error Other Than STATUS_PATH_NOT_COVERED."
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
- Handle unusual AT_PHDR offsets (Akira Kawata)
- Fix initial mapping size when PT_LOADs are not ordered (Alexey Dobriyan)
- Move more code under CONFIG_COREDUMP (Alexey Dobriyan)
- Fix missing mmap_lock in file_files_note (Eric W. Biederman)
- Remove a.out support for alpha and m68k (Eric W. Biederman)
- Include first pages of non-exec ELF libraries in coredump (Jann Horn)
- Don't write past end of notes for regset gap in coredump (Rick Edgecombe)
- Comment clean-ups (Tom Rix)
- Force single empty string when argv is empty (Kees Cook)
- Add NULL argv selftest (Kees Cook)
- Properly redefine PT_GNU_* in terms of PT_LOOS (Kees Cook)
- MAINTAINERS: Update execve entry with tree (Kees Cook)
- Introduce initial KUnit testing for binfmt_elf (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmI4ji4WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJi7VD/9o+PndYkeGclL7sYfouhSzK21W
go4SGCrTl0oK/mfz3qXVYeS4VFjNTCTEs8rSZdjHN8a9VAVSJ38z6FPwbSQobzEP
zXPuvwxe4GM4jb8FsBTcTEl1Wfw6kUV9JHXqFje6MuiZMXa8YDD+UMl95CgmGi1L
5sOw4quHXkG8nlC0v1PI9XSpmzK2nHmXBWVddnPXTUmEfitvoIJdf0iTJ4/4mYM/
OwrCiufGHvGtQFUrYTxgiZ3nvFdAkZDt+P8GA8NJOBCMDTPvsk57uTok1sW6CRFT
lSymgoc3SczBtHYO6nFl5U04XGsNY+iHYhjhNL10IoucdCvS2VS0vEb8ZXKg6wtQ
/tbgf1Mcfu7eoClA0ZjQX/pQbkPYL/s++Lwkc7pzknbmdwq+1yZF1+4Y1XItR4jJ
kUhVsewQuU0os7BnaREkFOcwqXfA4hixb9w79p+SjMX8/XrnSkLJ3cFswkGTUxdO
DOwhVcmqsZdVXMMk0R3oOtm9ABSp/FqvT8At2kZI0W93jhZGHWzOrU+psnkTUcDt
KpFEJzdoh4ImZvBK8F5f07dAlqeVEZvVDhBt+x1Wxcu90p7rmZJT8OV2mJCDVhZG
E2PW7UuLOAbgRM+E+gxz7SkpIMtOSFlxT2xGuygcRbIxOOeVnj1x9NwGdI9xcgpF
s021x7TcHbpvYakRsg==
=SyEY
-----END PGP SIGNATURE-----
Merge tag 'execve-v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
"Execve and binfmt updates.
Eric and I have stepped up to be the active maintainers of this area,
so here's our first collection. The bulk of the work was in coredump
handling fixes; additional details are noted below:
- Handle unusual AT_PHDR offsets (Akira Kawata)
- Fix initial mapping size when PT_LOADs are not ordered (Alexey
Dobriyan)
- Move more code under CONFIG_COREDUMP (Alexey Dobriyan)
- Fix missing mmap_lock in file_files_note (Eric W. Biederman)
- Remove a.out support for alpha and m68k (Eric W. Biederman)
- Include first pages of non-exec ELF libraries in coredump (Jann
Horn)
- Don't write past end of notes for regset gap in coredump (Rick
Edgecombe)
- Comment clean-ups (Tom Rix)
- Force single empty string when argv is empty (Kees Cook)
- Add NULL argv selftest (Kees Cook)
- Properly redefine PT_GNU_* in terms of PT_LOOS (Kees Cook)
- MAINTAINERS: Update execve entry with tree (Kees Cook)
- Introduce initial KUnit testing for binfmt_elf (Kees Cook)"
* tag 'execve-v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf: Don't write past end of notes for regset gap
a.out: Stop building a.out/osf1 support on alpha and m68k
coredump: Don't compile flat_core_dump when coredumps are disabled
coredump: Use the vma snapshot in fill_files_note
coredump/elf: Pass coredump_params into fill_note_info
coredump: Remove the WARN_ON in dump_vma_snapshot
coredump: Snapshot the vmas in do_coredump
coredump: Move definition of struct coredump_params into coredump.h
binfmt_elf: Introduce KUnit test
ELF: Properly redefine PT_GNU_* in terms of PT_LOOS
MAINTAINERS: Update execve entry with more details
exec: cleanup comments
fs/binfmt_elf: Refactor load_elf_binary function
fs/binfmt_elf: Fix AT_PHDR for unusual ELF files
binfmt: move more stuff undef CONFIG_COREDUMP
selftests/exec: Test for empty string on NULL argv
exec: Force single empty string when argv is empty
coredump: Also dump first pages of non-executable ELF libraries
ELF: fix overflow in total mapping size calculation
When the ring is exiting, as part of the shutdown, poll requests are
removed. But io_poll_remove_all() does not remove entries when finding
them, and since completions are done out-of-band, we can find and remove
the same entry multiple times.
We do guard the poll execution by poll ownership, but that does not
exclude us from reissuing a new one once the previous removal ownership
goes away.
This can race with poll execution as well, where we then end up seeing
req->apoll be NULL because a previous task_work requeue finished the
request.
Remove the poll entry when we find it and get ownership of it. This
prevents multiple invocations from finding it.
Fixes: aa43477b04 ("io_uring: poll rework")
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0/7IQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm+FD/9wmpVQE30Y4Atgw4VygOXwlS6mhNHIVGUV
Khd34mwVSh21hjBnyenWJX7KlcVT5YsfEmEz/CNvhxG7QwqjjFuu085n5eImJkHZ
yP9N2jLzR5mECs9nZ9FGZO7h6hl8gEX5agR/QndoVBMyA3ho4SsXyHUoDoobgQ+R
SuwAB0gvxA72wMGgIPtJ/y3EJxQyja+6kpOQiO+avQgccGXloi44YRLhqUJ5RkCE
iJxeiTB4xJt46f2RnU6cS8yy4WqsTY4PLCnM2mpgkQKDfesqA+4wZh1K7TcNKVN1
MntawBwHEtp0JqZxQQqr6PFqqOLoIs22+JJagFxzk00jHy+s10dTL1O8Yj/If9Y2
l7ivguJog9yXPTdJ7THR2x1RTTxYN7p4DfU6L6cw7VBmG51mSpKwePlK15dqUIC9
mpYUgqrR4v1LYXLSpDHqPFRtARKvyzNmmk61qmA3SyWLOXbOuyMYWHYW5ta62S9B
MnqaFjrKHrRzI2co8/Kqyv9t5ffb7eUvu9OUAmM5GnEpl/yg8iEUN+Rk55f0AnNa
1ABRUpiB6AOxRZZ/pqUBCB7wtkoyEk/O/Jqax0FrZWTYtdTQebMC8fbfPXCIdc3A
HH7e7Kc8mr4QLhTS9oJ/pvNrm4kdkunTZe8CGQmocGiJjU5sWrMFRNbk2N7/fuAi
kAYOByEjDA==
=PzC6
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/alloc-cleanups-2022-03-18' of git://git.kernel.dk/linux-block
Pull bio_alloc() cleanups from Jens Axboe:
"Filesystem cleanups to pass the bio op to bio_alloc() instead of
setting it just before bio submission".
* tag 'for-5.18/alloc-cleanups-2022-03-18' of git://git.kernel.dk/linux-block:
f2fs: pass the bio operation to bio_alloc_bioset
f2fs: don't pass a bio to f2fs_target_device
nilfs2: pass the operation to bio_alloc
ext4: pass the operation to bio_alloc
mpage: pass the operation to bio_alloc
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI098oQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppi8D/9OWymFBxOIpnr+/+SToKBNBfOdnYX5BuD/
n3LOxvGR35GLouO1izT+jdNBnK1P2C7RjohLRbZlrmr3Y/ZX2Cfpt1ziyggogVkY
+XuiP9h9MKFcCPNlXQIZFUuB74nLQW4mY6L+rbqho5ghyYnHthFTyYukggoouCqF
GwMqFzl+9wUkagHrNt+WCGky2fiKTxRB8yajczSM8xXuc7yBZ0+njMvXJ8dJNA3x
x1R9y6A67v6M5wEeROt/HCDH9p3BoM6ZPAI8ij5zlWVnDpKcjUgDMW7YNbZeYnKk
m3RrDYYlh95I6zKgggOEENknn4Fl67A1MJx84QzhbeAujc8/JSmJ6OROaiQBzQGM
h4Yd7+Fp3DTwGlhxVpXTkMX3EDE0NgUy7QVJknrxCkz/MjMb2n+2tV8GJlTDadVV
sSWFnGDdZRf9Zebomg0JLQQZwH5ZAi2EiOdmNpHYKUMfKYE4C5FDEQceUiqP8Ocy
Z5Eia8I745F7tm7gBC/Zkjjc/NZWT56qrlmgp3Cs32ReHG1suhq+tw+taCmFD32m
8I+QDgxP1vKkxH6o2W3Zq8BH7smauvjjGRAHYNZMyr58vQQ7YSRHkTuVMY9AJhCt
UftsMRYSR/tNtRIyaRRUU+YBTVNEhim9dtJ7h/J2COdHBCGNHFH+5kUlPw4tiknr
Hs2AdQhvJA==
=zzak
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/io_uring-statx-2022-03-18' of git://git.kernel.dk/linux-block
Pull io_uring statx fixes from Jens Axboe:
"On top of the main io_uring branch, this is to ensure that the
filename component of statx is stable after submit.
That requires a few VFS related changes"
* tag 'for-5.18/io_uring-statx-2022-03-18' of git://git.kernel.dk/linux-block:
io-uring: Make statx API stable
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI09cgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjI1D/4uo1FnG+mF9ZEPAy/FKBh6I74gHj7eBJyw
obKyH2oszN19VI+HLYYQgBrPwlc4lOzxmYM0kbAQWfB+PCgxAExX+OkJrGrxWW3R
Hlc1RSQkravh+CBzEc1VIYIWK3XF+guZIwqLPa3gS0hwea8LYBae9h3gKjhvb2kH
4BbSh7Ax5k9klVUcNJuUXnG7nytyaiaQAbMl88iGV3S2bpseWNi0WTHvJ3Eizbyz
sLponJcgUb0/H2G00MXf3M3+V40s7AiLXeEMwFcv29PZUQIEyp5P8Rx43//48nbE
0Rms88Uz4tFHRyFe/48IKa4tjJgk6fy1Cjd3+Thyn8TeJaYhD6vZu4LA/uGo8fkm
EZEmLFHw5mqn7Xz3OA31ErnnpSmwT/T164vMPcvrlLao/1ZSHD1fK2PjyDMTX+aO
1wec5+mMsTbOFgzYJRH3tyEH/GfWxpsNJIPfdCjV5BRPd5HwFBBTvRwFEVTX6+Go
CS4JAGWkTgmr2u1OeyzaQ1lYPZWD1GTt/HJj7rEvuqPzlreX9nRlvTtIcVubw39a
dgx7nWLLCOCN3o4tFHEvmf05/92TiQd1Vw8wEQT0NdmkLuqUuJJOnVo53YWzHA5Y
qqMwPcciiYMi92btrJdsZOgj3GCKxP6rLxyHJm1mY5kFSjEB5e0swV7xryme40r4
ZJ6sUvEyHg==
=OBma
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/io_uring-2022-03-18' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Fixes for current file position. Still doesn't have the f_pos_lock
sorted, but it's a step in the right direction (Dylan)
- Tracing updates (Dylan, Stefan)
- Improvements to io-wq locking (Hao)
- Improvements for provided buffers (me, Pavel)
- Support for registered file descriptors (me, Xiaoguang)
- Support for ring messages (me)
- Poll improvements (me)
- Fix for fixed buffers and non-iterator reads/writes (me)
- Support for NAPI on sockets (Olivier)
- Ring quiesce improvements (Usama)
- Misc fixes (Olivier, Pavel)
* tag 'for-5.18/io_uring-2022-03-18' of git://git.kernel.dk/linux-block: (42 commits)
io_uring: terminate manual loop iterator loop correctly for non-vecs
io_uring: don't check unrelated req->open.how in accept request
io_uring: manage provided buffers strictly ordered
io_uring: fold evfd signalling under a slower path
io_uring: thin down io_commit_cqring()
io_uring: shuffle io_eventfd_signal() bits around
io_uring: remove extra barrier for non-sqpoll iopoll
io_uring: fix provided buffer return on failure for kiocb_done()
io_uring: extend provided buf return to fails
io_uring: refactor timeout cancellation cqe posting
io_uring: normilise naming for fill_cqe*
io_uring: cache poll/double-poll state with a request flag
io_uring: cache req->apoll->events in req->cflags
io_uring: move req->poll_refs into previous struct hole
io_uring: make tracing format consistent
io_uring: recycle apoll_poll entries
io_uring: remove duplicated member check for io_msg_ring_prep()
io_uring: allow submissions to continue on error
io_uring: recycle provided buffers if request goes async
io_uring: ensure reads re-import for selected buffers
...
Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full. This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
- Support for including MTE tags in ELF coredumps
- Instruction encoder updates, including fixes to 64-bit immediate
generation and support for the LSE atomic instructions
- Improvements to kselftests for MTE and fpsimd
- Symbol aliasing and linker script cleanups
- Reduce instruction cache maintenance performed for user mappings
created using contiguous PTEs
- Support for the new "asymmetric" MTE mode, where stores are checked
asynchronously but loads are checked synchronously
- Support for the latest pointer authentication algorithm ("QARMA3")
- Support for the DDR PMU present in the Marvell CN10K platform
- Support for the CPU PMU present in the Apple M1 platform
- Use the RNDR instruction for arch_get_random_{int,long}()
- Update our copy of the Arm optimised string routines for str{n}cmp()
- Fix signal frame generation for CPUs which have foolishly elected to
avoid building in support for the fpsimd instructions
- Workaround for Marvell GICv3 erratum #38545
- Clarification to our Documentation (booting reqs. and MTE prctl())
- Miscellanous cleanups and minor fixes
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmIvta8QHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNAIhB/oDSva5FryAFExVuIB+mqRkbZO9kj6fy/5J
ctN9LEVO2GI/U1TVAUWop1lXmP8Kbq5UCZOAuY8sz7dAZs7NRUWkwTrXVhaTpi6L
oxCfu5Afu76d/TGgivNz+G7/ewIJRFj5zCPmHezLF9iiWPUkcAsP0XCp4a0iOjU4
04O4d7TL/ap9ujEes+U0oEXHnyDTPrVB2OVE316FKD1fgztcjVJ2U+TxX5O4xitT
PPIfeQCjQBq1B2OC1cptE3wpP+YEr9OZJbx+Ieweidy1CSInEy0nZ13tLoUnGPGU
KPhsvO9daUCbhbd5IDRBuXmTi/sHU4NIB8LNEVzT1mUPnU8pCizv
=ziGg
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
- Support for including MTE tags in ELF coredumps
- Instruction encoder updates, including fixes to 64-bit immediate
generation and support for the LSE atomic instructions
- Improvements to kselftests for MTE and fpsimd
- Symbol aliasing and linker script cleanups
- Reduce instruction cache maintenance performed for user mappings
created using contiguous PTEs
- Support for the new "asymmetric" MTE mode, where stores are checked
asynchronously but loads are checked synchronously
- Support for the latest pointer authentication algorithm ("QARMA3")
- Support for the DDR PMU present in the Marvell CN10K platform
- Support for the CPU PMU present in the Apple M1 platform
- Use the RNDR instruction for arch_get_random_{int,long}()
- Update our copy of the Arm optimised string routines for str{n}cmp()
- Fix signal frame generation for CPUs which have foolishly elected to
avoid building in support for the fpsimd instructions
- Workaround for Marvell GICv3 erratum #38545
- Clarification to our Documentation (booting reqs. and MTE prctl())
- Miscellanous cleanups and minor fixes
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
docs: sysfs-devices-system-cpu: document "asymm" value for mte_tcf_preferred
arm64/mte: Remove asymmetric mode from the prctl() interface
arm64: Add cavium_erratum_23154_cpus missing sentinel
perf/marvell: Fix !CONFIG_OF build for CN10K DDR PMU driver
arm64: mm: Drop 'const' from conditional arm64_dma_phys_limit definition
Documentation: vmcoreinfo: Fix htmldocs warning
kasan: fix a missing header include of static_keys.h
drivers/perf: Add Apple icestorm/firestorm CPU PMU driver
drivers/perf: arm_pmu: Handle 47 bit counters
arm64: perf: Consistently make all event numbers as 16-bits
arm64: perf: Expose some Armv9 common events under sysfs
perf/marvell: cn10k DDR perf event core ownership
perf/marvell: cn10k DDR perfmon event overflow handling
perf/marvell: CN10k DDR performance monitor support
dt-bindings: perf: marvell: cn10k ddr performance monitor
arm64: clean up tools Makefile
perf/arm-cmn: Update watchpoint format
perf/arm-cmn: Hide XP PUB events for CMN-600
arm64: drop unused includes of <linux/personality.h>
arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones
...
These functions are page cache functionality and don't need to be
declared in fs.h.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Add kernel-doc and return the number of pages removed in order to
get the statistics right in __invalidate_mapping_pages().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
This saves a lot of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
As bughunter reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215709
f2fs may hang when mounting a fuzzed image, the dmesg shows as below:
__filemap_get_folio+0x3a9/0x590
pagecache_get_page+0x18/0x60
__get_meta_page+0x95/0x460 [f2fs]
get_checkpoint_version+0x2a/0x1e0 [f2fs]
validate_checkpoint+0x8e/0x2a0 [f2fs]
f2fs_get_valid_checkpoint+0xd0/0x620 [f2fs]
f2fs_fill_super+0xc01/0x1d40 [f2fs]
mount_bdev+0x18a/0x1c0
f2fs_mount+0x15/0x20 [f2fs]
legacy_get_tree+0x28/0x50
vfs_get_tree+0x27/0xc0
path_mount+0x480/0xaa0
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is cp_pack_total_block_count field in checkpoint was fuzzed
to one, as calcuated, two cp pack block locates in the same block address,
so then read latter cp pack block, it will block on the page lock due to
the lock has already held when reading previous cp pack block, fix it by
adding sanity check for cp_pack_total_block_count.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Changed a way of showing values of them to use strings.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to
toggle whether or not the client will engage in actively discovery
of trunking locations.
v2 make notrunkdiscovery default
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 1976b2b314 ("NFSv4.1 query for fs_location attr on a new file system")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reset the last_readdir at the same time, and add a comment explaining
why we don't free last_readdir when dir_emit returns false.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If read_mapping_folio() fails then "inline_version" is printed without
being initialized.
[ jlayton: use CEPH_INLINE_NONE instead of "-1" ]
Fixes: 083db6fd3e ("ceph: uninline the data on a file opened for writing")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make the math a bit simpler to understand (should not
affect execution speeds).
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Latencies are of type ktime_t, coverting from jiffies is incorrect.
Also, switch to "struct ceph_timespec" for r/w/m latencies.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_find_inode() may will fail and return NULL.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_get_inode() will search for or insert a new inode into the
hash for the given vino, and return a reference to it. If new is
non-NULL, its reference is consumed.
We should release the reference when in error handing cases.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Without a full memory barrier between the store to the flags and the
load of the SQ tail the two operations can be reordered and this can
lead to a situation where the SQPOLL thread goes to sleep while the
application writes to the SQ tail and doesn't see the wakeup flag.
This memory barrier pairs with a full memory barrier in the application
between its store to the SQ tail and its load of the flags.
Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220321090059.46313-1-almogkh@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that we call fsnotify_modify() if we write a file, and that we
do fsnotify_access() if we read it. This enables anyone using inotify
on the file to get notified.
Ditto for fallocate, ensure that fsnotify_modify() is called.
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While the original code is valid, it is not the obvious choice for the
sizeof() call and in preparation to limit the scope of the list iterator
variable the sizeof should be changed to the size of the destination.
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Use netif_is_bridge_port defined in <linux/netdevice.h> instead of
open-coding it.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Steve French <stfrench@microsoft.com>
I've been chasing a recent resurgence in generic/388 recovery
failure and/or corruption events. The events have largely been
uninitialised inode chunks being tripped over in log recovery
such as:
XFS (pmem1): User initiated shutdown received.
pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem.
XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
XFS (pmem1): Unmounting Filesystem
XFS (pmem1): Mounting V5 Filesystem
XFS (pmem1): Starting recovery (logdev: internal)
XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): First 128 bytes of corrupted metadata buffer:
00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................
XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
XFS (pmem1): log mount/recovery failed: error -117
XFS (pmem1): log mount failed
There have been isolated random other issues, too - xfs_repair fails
because it finds some corruption in symlink blocks, rmap
inconsistencies, etc - but they are nowhere near as common as the
uninitialised inode chunk failure.
The problem has clearly happened at runtime before recovery has run;
I can see the ICREATE log item in the log shortly before the
actively recovered range of the log. This means the ICREATE was
definitely created and written to the log, but for some reason the
tail of the log has been moved past the ordered buffer log item that
tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
log moving past the ICREATE log item before the inode chunk buffer
is written to disk.
Tracing the fsstress processes that are running when the filesystem
shut down immediately pin-pointed the problem:
user shutdown marks xfs_mount as shutdown
godown-213341 [008] 6398.022871: console: [ 6397.915392] XFS (pmem1): User initiated shutdown received.
.....
aild tries to push ordered inode cluster buffer
xfsaild/pmem1-213314 [001] 6398.022974: xfs_buf_trylock: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
xfsaild/pmem1-213314 [001] 6398.022976: xfs_ilock_nowait: dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
calls xfs_iflush_abort() to kill writeback of the inode.
Inode is removed from AIL, drops cluster buffer reference.
xfsaild/pmem1-213314 [001] 6398.022977: xfs_ail_delete: dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
xfsaild/pmem1-213314 [001] 6398.022978: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
.....
All inodes on cluster buffer are aborted, then the cluster buffer
itself is aborted and removed from the AIL *without writeback*:
xfsaild/pmem1-213314 [001] 6398.023011: xfs_buf_error_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
xfsaild/pmem1-213314 [001] 6398.023012: xfs_ail_delete: dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
The inode buffer was at 7/20344 when it was removed from the AIL.
xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_item_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
.....
Userspace is still running, doing stuff. an fsstress process runs
syncfs() or sync() and we end up in sync_fs_one_sb() which issues
a log force. This pushes on the CIL:
fsstress-213322 [001] 6398.024430: xfs_fs_sync_fs: dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
<...>-194402 [001] 6398.024467: kmem_alloc: size 176 flags 0x14 caller xlog_cil_push_work+0x9f
And the CIL fills up iclogs with pending changes. This picks up
the current tail from the AIL:
<...>-194402 [001] 6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags caller xlog_write+0x149
<...>-194402 [001] 6398.024498: xlog_iclog_switch: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags caller xlog_state_get_iclog_space+0x37e
<...>-194402 [001] 6398.024521: xlog_iclog_release: dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags caller xlog_write+0x5f9
<...>-194402 [001] 6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
And it moves the tail of the log to 7/21000 from 7/20344. This
*moves the tail of the log beyond the ICREATE transaction* that was
at 7/20344 and pinned by the inode cluster buffer that was cancelled
above.
....
godown-213341 [008] 6398.027005: xfs_force_shutdown: dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
godown-213341 [008] 6398.027022: console: [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
godown-213341 [008] 6398.030551: console: [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
And finally the log itself is now shutdown, stopping all further
writes to the log. But this is too late to prevent the corruption
that moving the tail of the log forwards after we start cancelling
writeback causes.
The fundamental problem here is that we are using the wrong shutdown
checks for log items. We've long conflated mount shutdown with log
shutdown state, and I started separating that recently with the
atomic shutdown state changes in commit b36d4651e1 ("xfs: make
forced shutdown processing atomic"). The changes in that commit
series are directly responsible for being able to diagnose this
issue because it clearly separated mount shutdown from log shutdown.
Essentially, once we start cancelling writeback of log items and
removing them from the AIL because the filesystem is shut down, we
*cannot* update the journal because we may have cancelled the items
that pin the tail of the log. That moves the tail of the log
forwards without having written the metadata back, hence we have
corrupt in memory state and writing to the journal propagates that
to the on-disk state.
What commit b36d4651e1 makes clear is that log item state needs to
change relative to log shutdown, not mount shutdown. IOWs, anything
that aborts metadata writeback needs to check log shutdown state
because log items directly affect log consistency. Having them check
mount shutdown state introduces the above race condition where we
cancel metadata writeback before the log shuts down.
To fix this, this patch works through all log items and converts
shutdown checks to use xlog_is_shutdown() rather than
xfs_is_shutdown(), so that we don't start aborting metadata
writeback before we shut off journal writes.
AFAICT, this race condition is a zero day IO error handling bug in
XFS that dates back to the introduction of XLOG_IO_ERROR,
XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The AIL operates purely on log items, so it is a log centric
subsystem. Divorce it from the xfs_mount and instead have it pass
around xlog pointers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Log items belong to the log, not the xfs_mount. Convert the mount
pointer in the log item to a xlog pointer in preparation for
upcoming log centric changes to the log items.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When the AIL tries to flush the CIL, it relies on the CIL push
ending up on stable storage without having to wait for and
manipulate iclog state directly. However, if there is already a
pending CIL push when the AIL tries to flush the CIL, it won't set
the cil->xc_push_commit_stable flag and so the CIL push will not
actively flush the commit record iclog.
generic/530 when run on a single CPU test VM can trigger this fairly
reliably. This test exercises unlinked inode recovery, and can
result in inodes being pinned in memory by ongoing modifications to
the inode cluster buffer to record unlinked list modifications. As a
result, the first inode unlinked in a buffer can pin the tail of the
log whilst the inode cluster buffer is pinned by the current
checkpoint that has been pushed but isn't on stable storage because
because the cil->xc_push_commit_stable was not set. This results in
the log/AIL effectively deadlocking until something triggers the
commit record iclog to be pushed to stable storage (i.e. the
periodic log worker calling xfs_log_force()).
The fix is two-fold - first we should always set the
cil->xc_push_commit_stable when xlog_cil_flush() is called,
regardless of whether there is already a pending push or not.
Second, if the CIL is empty, we should trigger an iclog flush to
ensure that the iclogs of the last checkpoint have actually been
submitted to disk as that checkpoint may not have been run under
stable completion constraints.
Reported-and-tested-by: Matthew Wilcox <willy@infradead.org>
Fixes: 0020a190cf ("xfs: AIL needs asynchronous CIL forcing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfs_ail_push_all_sync() has a loop like this:
while max_ail_lsn {
prepare_to_wait(ail_empty)
target = max_ail_lsn
wake_up(ail_task);
schedule()
}
Which is designed to sleep until the AIL is emptied. When
xfs_ail_update_finish() moves the tail of the log, it does:
if (list_empty(&ailp->ail_head))
wake_up_all(&ailp->ail_empty);
So it will only wake up the sync push waiter when the AIL goes
empty. If, by the time the push waiter has woken, the AIL has more
in it, it will reset the target, wake the push task and go back to
sleep.
The problem here is that if the AIL is having items added to it
when xfs_ail_push_all_sync() is called, then they may get inserted
into the AIL at a LSN higher than the target LSN. At this point,
xfsaild_push() will see that the target is X, the item LSNs are
(X+N) and skip over them, hence never pushing the out.
The result of this the AIL will not get emptied by the AIL push
thread, hence xfs_ail_finish_update() will never see the AIL being
empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
gets woken and hence cannot update the push target to capture the
items beyond the current target on the LSN.
This is a TOCTOU type of issue so the way to avoid it is to not
use the push target at all for sync pushes. We know that a sync push
is being requested by the fact the ail_empty wait queue is active,
hence the xfsaild can just set the target to max_ail_lsn on every
push that we see the wait queue active. Hence we no longer will
leave items on the AIL that are beyond the LSN sampled at the start
of a sync push.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
AIL flushing can get stuck here:
[316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
[316649.007807] Not tainted 5.17.0-rc6-dgc+ #975
[316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316649.011720] task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000
[316649.014112] Call Trace:
[316649.014841] <TASK>
[316649.015492] __schedule+0x30d/0x9e0
[316649.017745] schedule+0x55/0xd0
[316649.018681] io_schedule+0x4b/0x80
[316649.019683] xfs_buf_wait_unpin+0x9e/0xf0
[316649.021850] __xfs_buf_submit+0x14a/0x230
[316649.023033] xfs_buf_delwri_submit_buffers+0x107/0x280
[316649.024511] xfs_buf_delwri_submit_nowait+0x10/0x20
[316649.025931] xfsaild+0x27e/0x9d0
[316649.028283] kthread+0xf6/0x120
[316649.030602] ret_from_fork+0x1f/0x30
in the situation where flushing gets preempted between the unpin
check and the buffer trylock under nowait conditions:
blk_start_plug(&plug);
list_for_each_entry_safe(bp, n, buffer_list, b_list) {
if (!wait_list) {
if (xfs_buf_ispinned(bp)) {
pinned++;
continue;
}
Here >>>>>>
if (!xfs_buf_trylock(bp))
continue;
This means submission is stuck until something else triggers a log
force to unpin the buffer.
To get onto the delwri list to begin with, the buffer pin state has
already been checked, and hence it's relatively rare we get a race
between flushing and encountering a pinned buffer in delwri
submission to begin with. Further, to increase the pin count the
buffer has to be locked, so the only way we can hit this race
without failing the trylock is to be preempted between the pincount
check seeing zero and the trylock being run.
Hence to avoid this problem, just invert the order of trylock vs
pin check. We shouldn't hit that many pinned buffers here, so
optimising away the trylock for pinned buffers should not matter for
performance at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
After 963 iterations of generic/530, it deadlocked during recovery
on a pinned inode cluster buffer like so:
XFS (pmem1): Starting recovery (logdev: internal)
INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
Not tainted 5.17.0-rc6-dgc+ #975
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:0 state:D stack:13024 pid:306037 ppid: 2 flags:0x00004000
Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
Call Trace:
<TASK>
__schedule+0x30d/0x9e0
schedule+0x55/0xd0
schedule_timeout+0x114/0x160
__down+0x99/0xf0
down+0x5e/0x70
xfs_buf_lock+0x36/0xf0
xfs_buf_find+0x418/0x850
xfs_buf_get_map+0x47/0x380
xfs_buf_read_map+0x54/0x240
xfs_trans_read_buf_map+0x1bd/0x490
xfs_imap_to_bp+0x4f/0x70
xfs_iunlink_map_ino+0x66/0xd0
xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
xfs_iunlink_remove_inode+0xf2/0x1d0
xfs_inactive_ifree+0x1a3/0x900
xfs_inode_unlink+0xcc/0x210
xfs_inodegc_worker+0x1ac/0x2f0
process_one_work+0x1ac/0x390
worker_thread+0x56/0x3c0
kthread+0xf6/0x120
ret_from_fork+0x1f/0x30
</TASK>
task:mount state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x30d/0x9e0
schedule+0x55/0xd0
schedule_timeout+0x114/0x160
__down+0x99/0xf0
down+0x5e/0x70
xfs_buf_lock+0x36/0xf0
xfs_buf_find+0x418/0x850
xfs_buf_get_map+0x47/0x380
xfs_buf_read_map+0x54/0x240
xfs_trans_read_buf_map+0x1bd/0x490
xfs_imap_to_bp+0x4f/0x70
xfs_iget+0x300/0xb40
xlog_recover_process_one_iunlink+0x4c/0x170
xlog_recover_process_iunlinks.isra.0+0xee/0x130
xlog_recover_finish+0x57/0x110
xfs_log_mount_finish+0xfc/0x1e0
xfs_mountfs+0x540/0x910
xfs_fs_fill_super+0x495/0x850
get_tree_bdev+0x171/0x270
xfs_fs_get_tree+0x15/0x20
vfs_get_tree+0x24/0xc0
path_mount+0x304/0xba0
__x64_sys_mount+0x108/0x140
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x30d/0x9e0
schedule+0x55/0xd0
io_schedule+0x4b/0x80
xfs_buf_wait_unpin+0x9e/0xf0
__xfs_buf_submit+0x14a/0x230
xfs_buf_delwri_submit_buffers+0x107/0x280
xfs_buf_delwri_submit_nowait+0x10/0x20
xfsaild+0x27e/0x9d0
kthread+0xf6/0x120
ret_from_fork+0x1f/0x30
We have the mount process waiting on an inode cluster buffer read,
inodegc doing unlink waiting on the same inode cluster buffer, and
the AIL push thread blocked in writeback waiting for the inode
cluster buffer to become unpinned.
What has happened here is that the AIL push thread has raced with
the inodegc process modifying, committing and pinning the inode
cluster buffer here in xfs_buf_delwri_submit_buffers() here:
blk_start_plug(&plug);
list_for_each_entry_safe(bp, n, buffer_list, b_list) {
if (!wait_list) {
if (xfs_buf_ispinned(bp)) {
pinned++;
continue;
}
Here >>>>>>
if (!xfs_buf_trylock(bp))
continue;
Basically, the AIL has found the buffer wasn't pinned and got the
lock without blocking, but then the buffer was pinned. This implies
the processing here was pre-empted between the pin check and the
lock, because the pin count can only be increased while holding the
buffer locked. Hence when it has gone to submit the IO, it has
blocked waiting for the buffer to be unpinned.
With all executing threads now waiting on the buffer to be unpinned,
we normally get out of situations like this via the background log
worker issuing a log force which will unpinned stuck buffers like
this. But at this point in recovery, we haven't started the log
worker. In fact, the first thing we do after processing intents and
unlinked inodes is *start the log worker*. IOWs, we start it too
late to have it break deadlocks like this.
Avoid this and any other similar deadlock vectors in intent and
unlinked inode recovery by starting the log worker before we recover
intents and unlinked inodes. This part of recovery runs as though
the filesystem is fully active, so we really should have the same
infrastructure running as we normally do at runtime.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
We currently have a race where we recycle the selected buffer if poll
returns IO_APOLL_OK. But that's too late, as the poll could already be
triggering or have triggered. If that race happens, then we're putting a
buffer that's already being used.
Fix this by recycling before we arm poll. This does mean that we'll
sometimes almost instantly re-select the buffer, but it's rare enough in
testing that it should not pose a performance issue.
Fixes: b1c6264575 ("io_uring: recycle provided buffers if request goes async")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The cifs_demultiplexer_thread should only call cifs_reconnect.
If any other thread wants to trigger a reconnect, they can do
so by updating the server tcpStatus to CifsNeedReconnect.
The last patch attempted to use the same helper function for
both types of threads, but that causes other issues
with lock dependencies.
This patch creates a new helper for non-cifsd threads, that
will indicate to cifsd that the server needs reconnect.
Fixes: 2a05137a05 ("cifs: mark sessions for reconnection in helper function")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove the spinlock around the tree traversal as we are calling possibly
sleeping functions.
We do not need a spinlock here as there will be no modifications to this
tree at this point.
This prevents warnings like this to occur in dmesg:
[ 653.774996] BUG: sleeping function called from invalid context at kernel/loc\
king/mutex.c:280
[ 653.775088] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1827, nam\
e: umount
[ 653.775152] preempt_count: 1, expected: 0
[ 653.775191] CPU: 0 PID: 1827 Comm: umount Tainted: G W OE 5.17.0\
-rc7-00006-g4eb628dd74df #135
[ 653.775195] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-\
1.fc33 04/01/2014
[ 653.775197] Call Trace:
[ 653.775199] <TASK>
[ 653.775202] dump_stack_lvl+0x34/0x44
[ 653.775209] __might_resched.cold+0x13f/0x172
[ 653.775213] mutex_lock+0x75/0xf0
[ 653.775217] ? __mutex_lock_slowpath+0x10/0x10
[ 653.775220] ? _raw_write_lock_irq+0xd0/0xd0
[ 653.775224] ? dput+0x6b/0x360
[ 653.775228] cifs_kill_sb+0xff/0x1d0 [cifs]
[ 653.775285] deactivate_locked_super+0x85/0x130
[ 653.775289] cleanup_mnt+0x32c/0x4d0
[ 653.775292] ? path_umount+0x228/0x380
[ 653.775296] task_work_run+0xd8/0x180
[ 653.775301] exit_to_user_mode_loop+0x152/0x160
[ 653.775306] exit_to_user_mode_prepare+0x89/0xd0
[ 653.775315] syscall_exit_to_user_mode+0x12/0x30
[ 653.775322] do_syscall_64+0x48/0x90
[ 653.775326] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 187af6e98b44e5d8f25e1d41a92db138eb54416f ("cifs: fix handlecache and multiuser")
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When session gets reconnected during mount then read size in super block fs context
gets set to zero and after negotiate, rsize is not modified which results in
incorrect read with requested bytes as zero. Fixes intermittent failure
of xfstest generic/240
Note that stable requires a different version of this patch which will be
sent to the stable mailing list.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
RHBZ:1997367
When we collapse a range in smb3_collapse_range() we must make sure
we update the inode size and pagecache accordingly.
If not, both inode size and pagecahce may be stale until it is refreshed.
This can be demonstrated for the inode size by running :
xfs_io -i -f -c "truncate 320k" -c "fcollapse 64k 128k" -c "fiemap -v" \
/mnt/testfile
where we can see the result of stale data in the fiemap output.
The third line of the output is wrong, all this data should be truncated.
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: hole 128
1: [128..383]: 128..383 256 0x1
2: [384..639]: hole 256
And the correct output, when the inode size has been updated correctly should
look like this:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: hole 128
1: [128..383]: 128..383 256 0x1
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In multiuser each individual user has their own tcon structure for the
share and thus their own handle for a cached directory.
When we umount such a share we much make sure to release the pinned down dentry
for each such tcon and not just the master tcon.
Otherwise we will get nasty warnings on umount that dentries are still in use:
[ 3459.590047] BUG: Dentry 00000000115c6f41{i=12000000019d95,n=/} still in use\
(2) [unmount of cifs cifs]
...
[ 3459.590492] Call Trace:
[ 3459.590500] d_walk+0x61/0x2a0
[ 3459.590518] ? shrink_lock_dentry.part.0+0xe0/0xe0
[ 3459.590526] shrink_dcache_for_umount+0x49/0x110
[ 3459.590535] generic_shutdown_super+0x1a/0x110
[ 3459.590542] kill_anon_super+0x14/0x30
[ 3459.590549] cifs_kill_sb+0xf5/0x104 [cifs]
[ 3459.590773] deactivate_locked_super+0x36/0xa0
[ 3459.590782] cleanup_mnt+0x131/0x190
[ 3459.590789] task_work_run+0x5c/0x90
[ 3459.590798] exit_to_user_mode_loop+0x151/0x160
[ 3459.590809] exit_to_user_mode_prepare+0x83/0xd0
[ 3459.590818] syscall_exit_to_user_mode+0x12/0x30
[ 3459.590828] do_syscall_64+0x48/0x90
[ 3459.590833] entry_SYSCALL_64_after_hwframe+0x44/0xae
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmI0tLYACgkQiiy9cAdy
T1GllAv/W+4jMNZguBeAfCxSSuIjYEJI4Pz3QvBMto5gb4rgTEJrOLyz5n+Zj2+/
VpLnTt0GNOcKdstfO2bql3P3m+gWF6UdiJA8/LQ6vR4fLJIGyX6VcxtDXhI3+PaA
FoIRrNAEhY3knOaz8gCEVA18+F2QPGdmfFdMUziPf2vTqXtqsPXAUKJLQCwMWHXe
8aWTbfvhFR1b+xeZy3Rb9rOjzd1qRIgSZYDfpmkLHejckhSMyR5i0NiwikTxIA2A
IpHA48b/jfvOOyg7ABjZrAIumNZxZE7mYSjMGNhuEdiR17oVwtgsXn8c8DLLtNkY
+95dKXuuGIPiGGTRk2OnNotHdC0GOj5VT/Z4akSnHGZHGEBoNe35QkUPWFpusc8L
Wjt2hzcNErYnFo2Qi4QTI2i7NTFoTig0UYPd1KUnsvjeLQWo/DtQluu+FR7GEl1F
MDQDyaDgcm10nTDy3xA8zGQRdJORKM0Qzq9dTZs/sWUUN7trynk0KTCT8CgsGqH1
3jHXYrRR
=rhQe
-----END PGP SIGNATURE-----
Merge tag '5.17-rc8-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"Small fix for regression in multiuser mounts.
The additional improvements suggested by Ronnie to make the server and
session status handling code easier to read can wait for the 5.18
merge window."
* tag '5.17-rc8-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix incorrect session setup check for multiuser mounts
The fix for not advancing the iterator if we're using fixed buffers is
broken in that it can hit a condition where we don't terminate the loop.
This results in io-wq looping forever, asking to read (or write) 0 bytes
for every subsequent loop.
Reported-by: Joel Jaeschke <joel.jaeschke@gmail.com>
Link: https://github.com/axboe/liburing/issues/549
Fixes: 16c8d2df7e ("io_uring: ensure symmetry in handling iter types in loop_rw_iter()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In fill_thread_core_info() the ptrace accessible registers are collected
to be written out as notes in a core file. The note array is allocated
from a size calculated by iterating the user regset view, and counting the
regsets that have a non-zero core_note_type. However, this only allows for
there to be non-zero core_note_type at the end of the regset view. If
there are any gaps in the middle, fill_thread_core_info() will overflow the
note allocation, as it iterates over the size of the view and the
allocation would be smaller than that.
There doesn't appear to be any arch that has gaps such that they exceed
the notes allocation, but the code is brittle and tries to support
something it doesn't. It could be fixed by increasing the allocation size,
but instead just have the note collecting code utilize the array better.
This way the allocation can stay smaller.
Even in the case of no arch's that have gaps in their regset views, this
introduces a change in the resulting indicies of t->notes. It does not
introduce any changes to the core file itself, because any blank notes are
skipped in write_note_info().
In case, the allocation logic between fill_note_info() and
fill_thread_core_info() ever diverges from the usage logic, warn and skip
writing any notes that would overflow the array.
This fix is derrived from an earlier one[0] by Yu-cheng Yu.
[0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com
Looks like a victim of too much copy/paste, we should not be looking
at req->open.how in accept. The point is to check CLOEXEC and error
out, which we don't invalid direct descriptors on exec. Hence any
attempt to get a direct descriptor with CLOEXEC is invalid.
No harm is done here, as req->open.how.flags overlaps with
req->accept.flags, but it's very confusing and might change if either of
those command structs are modified.
Fixes: aaa4db12ef ("io_uring: accept directly into fixed file table")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let's enable GC_URGENT_HIGH mode during f2fs_disable_checkpoint(),
so that we can use SSR allocator for GCed data/node persistence,
it can improve the performance due to it avoiding migration of
data/node locates in selected target segment of SSR allocator.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When compressed file has blocks, f2fs_ioc_start_atomic_write will succeed,
but compressed flag will be remained in inode. If write partial compreseed
cluster and commit atomic write will cause data corruption.
This is the reproduction process:
Step 1:
create a compressed file ,write 64K data , call fsync(), then the blocks
are write as compressed cluster.
Step2:
iotcl(F2FS_IOC_START_ATOMIC_WRITE) --- this should be fail, but not.
write page 0 and page 3.
iotcl(F2FS_IOC_COMMIT_ATOMIC_WRITE) -- page 0 and 3 write as normal file,
Step3:
drop cache.
read page 0-4 -- Since page 0 has a valid block address, read as
non-compressed cluster, page 1 and 2 will be filled with compressed data
or zero.
The root cause is, after commit 7eab7a6968 ("f2fs: compress: remove
unneeded read when rewrite whole cluster"), in step 2, f2fs_write_begin()
only set target page dirty, and in f2fs_commit_inmem_pages(), we will write
partial raw pages into compressed cluster, result in corrupting compressed
cluster layout.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Fixes: 7eab7a6968 ("f2fs: compress: remove unneeded read when rewrite whole cluster")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to initialized sbi->gc_mode to GC_NORMAL explicitly.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When compiling with -Wformat, clang emits the following warnings:
fs/nfsd/flexfilelayout.c:120:27: warning: format specifies type 'unsigned
char' but the argument has type 'int' [-Wformat]
"%s.%hhu.%hhu", addr, port >> 8, port & 0xff);
~~~~ ^~~~~~~~~
%d
fs/nfsd/flexfilelayout.c:120:38: warning: format specifies type 'unsigned
char' but the argument has type 'int' [-Wformat]
"%s.%hhu.%hhu", addr, port >> 8, port & 0xff);
~~~~ ^~~~~~~~~~~
%d
The types of these arguments are unconditionally defined, so this patch
updates the format character to the correct ones for ints and unsigned
ints.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Bill Wendling <morbo@google.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Haynes <loghyr@hammerspace.com>
Workloads using provided buffers benefit from using and returning buffers
in the right order, and so does TLBs for that matter. Manage the internal
buffer list in a straight list, rather than use the head buffer as the
insertion node. Use a hashed list for the buffer group IDs instead of
xarray, the overhead is much lower this way. xarray provides internal
locking and other trickery that is handy for some uses cases, but
io_uring already locks internally for the buffer manipulation and needs
none of that.
This is good for about a 2% reduction in overhead, combination of the
improved management and the fact that the workload has an easier time
bundling back provided buffers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When renaming the whiteout file, the old whiteout file is not deleted.
Therefore, we add the old dentry size to the old dir like XFS.
Otherwise, an error may be reported due to `fscki->calc_sz != fscki->size`
in check_indes.
Fixes: 9e0a1fff8d ("ubifs: Implement RENAME_WHITEOUT")
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Once s_root is set, genric_shutdown_super() will be called if
fill_super() fails. That means, we will call ocfs2_dismount_volume()
twice in such case, which can lead to kernel crash.
Fix this issue by initializing filecheck kobj before setting s_root.
Link: https://lkml.kernel.org/r/20220310081930.86305-1-joseph.qi@linux.alibaba.com
Fixes: 5f483c4abb ("ocfs2: add kobject for online file check")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need a mid level of gc urgent mode to do GC forcibly in a period
of given gc_urgent_sleep_time, but not like using greedy GC approach
and switching to SSR mode such as gc urgent high mode. This can be
used for more aggressive periodic storage clean up.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In lz4_decompress_pages(), if size of decompressed data is not equal to
expected one, we should print the size rather than size of target buffer
for decompressed data, fix it.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
iput() has already judged the incoming parameter, so there is
no need to repeat the judgment here.
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
EROFS images should inherit modification time rather than change time,
since users and host tooling have no easy way to control change time.
To reflect the new timestamp meaning, i_ctime and i_ctime_nsec are
renamed to i_mtime and i_mtime_nsec.
Link: https://lore.kernel.org/r/20220311041829.3109511-1-dvander@google.com # v1
Signed-off-by: David Anderson <dvander@google.com>
[ Gao Xiang: update document as well. ]
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220317114959.106787-1-hsiangkao@linux.alibaba.com # v2
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
A recent change to how the SMB3 server (socket) and session status
is managed regressed multiuser mounts by changing the check
for whether session setup is needed to the socket (TCP_Server_info)
structure instead of the session struct (cifs_ses). Add additional
check in cifs_setup_sesion to fix this.
Fixes: 73f9bfbe3d ("cifs: maintain a state machine for tcp/smb/tcon sessions")
Reported-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add ->has_evfd flag, which is true IFF there is an eventfd attached, and
use it to hide io_eventfd_signal() into __io_commit_cqring_flush() and
combine fast checks in a single if. Also, gcc 11.2 wasn't inlining
io_cqring_ev_posted() without this change, so helps with that as well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f6168471997decded475a063f92915787975a30b.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_commit_cqring() is currently always under spinlock section, so it's
always better to keep it as slim as possible. Move
__io_commit_cqring_flush() out of it into ev_posted*(). If fast checks
do fail and this post-processing is required, we'll reacquire
->completion_lock, which is fine as we don't care about performance of
draining and offset timeouts.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec4e81fd720d3bc7bca8cb9152e080dad1a052f1.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
smp_mb() in io_cqring_ev_posted_iopoll() is only there because of
waitqueue_active(). However, non-SQPOLL IOPOLL ring doesn't wake the CQ
and so the barrier there is useless. Kill it, it's usually pretty
expensive.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d72e8ef6f7a3f6a72e18fad8409f7d47afc8da7d.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's never a good idea to put provided buffers without notifying the
userspace, it'll lead to userspace leaks, so add io_put_kbuf() in
io_req_complete_failed(). The fail helper is called by all sorts of
requests, but it's still safe to do as io_put_kbuf() will return 0 in
for all requests that don't support and so don't expect provided buffers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_fill_cqe*() is not always the best way to post CQEs just because
there is enough of infrastructure on top. Replace a raw call to a
variant of it inside of io_timeout_cancel(), which also saves us some
bloating and might help with batching later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46113ec4345764b4aef3b384ce38cceabaeedcbb.1647481208.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With commit "io_uring: cache req->apoll->events in req->cflags" applied,
we now have just io_poll_remove_entries() dipping into req->apoll when
it isn't strictly necessary.
Mark poll and double-poll with a flag, so we know if we need to look
at apoll->double_poll. This avoids pulling in those cachelines if we
don't need them. The common case is that the poll wake handler already
removed these entries while hot off the completion path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we arm poll on behalf of a different type of request, like a network
receive, then we allocate req->apoll as our poll entry. Running network
workloads shows io_poll_check_events() as the most expensive part of
io_uring, and it's all due to having to pull in req->apoll instead of
just the request which we have hot already.
Cache poll->events in req->cflags, which isn't used until the request
completes anyway. This isn't strictly needed for regular poll, where
req->poll.events is used and thus already hot, but for the sake of
unification we do it all around.
This saves 3-4% of overhead in certain request workloads.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we mount a jffs2 image, assume that the first few blocks of
the image are normal and contain at least one xattr-related inode,
but the next block is abnormal. As a result, an error is returned
in jffs2_scan_eraseblock(). jffs2_clear_xattr_subsystem() is then
called in jffs2_build_filesystem() and then again in
jffs2_do_fill_super().
Finally we can observe the following report:
==================================================================
BUG: KASAN: use-after-free in jffs2_clear_xattr_subsystem+0x95/0x6ac
Read of size 8 at addr ffff8881243384e0 by task mount/719
Call Trace:
dump_stack+0x115/0x16b
jffs2_clear_xattr_subsystem+0x95/0x6ac
jffs2_do_fill_super+0x84f/0xc30
jffs2_fill_super+0x2ea/0x4c0
mtd_get_sb+0x254/0x400
mtd_get_sb_by_nr+0x4f/0xd0
get_tree_mtd+0x498/0x840
jffs2_get_tree+0x25/0x30
vfs_get_tree+0x8d/0x2e0
path_mount+0x50f/0x1e50
do_mount+0x107/0x130
__se_sys_mount+0x1c5/0x2f0
__x64_sys_mount+0xc7/0x160
do_syscall_64+0x45/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Allocated by task 719:
kasan_save_stack+0x23/0x60
__kasan_kmalloc.constprop.0+0x10b/0x120
kasan_slab_alloc+0x12/0x20
kmem_cache_alloc+0x1c0/0x870
jffs2_alloc_xattr_ref+0x2f/0xa0
jffs2_scan_medium.cold+0x3713/0x4794
jffs2_do_mount_fs.cold+0xa7/0x2253
jffs2_do_fill_super+0x383/0xc30
jffs2_fill_super+0x2ea/0x4c0
[...]
Freed by task 719:
kmem_cache_free+0xcc/0x7b0
jffs2_free_xattr_ref+0x78/0x98
jffs2_clear_xattr_subsystem+0xa1/0x6ac
jffs2_do_mount_fs.cold+0x5e6/0x2253
jffs2_do_fill_super+0x383/0xc30
jffs2_fill_super+0x2ea/0x4c0
[...]
The buggy address belongs to the object at ffff8881243384b8
which belongs to the cache jffs2_xattr_ref of size 48
The buggy address is located 40 bytes inside of
48-byte region [ffff8881243384b8, ffff8881243384e8)
[...]
==================================================================
The triggering of the BUG is shown in the following stack:
-----------------------------------------------------------
jffs2_fill_super
jffs2_do_fill_super
jffs2_do_mount_fs
jffs2_build_filesystem
jffs2_scan_medium
jffs2_scan_eraseblock <--- ERROR
jffs2_clear_xattr_subsystem <--- free
jffs2_clear_xattr_subsystem <--- free again
-----------------------------------------------------------
An error is returned in jffs2_do_mount_fs(). If the error is returned
by jffs2_sum_init(), the jffs2_clear_xattr_subsystem() does not need to
be executed. If the error is returned by jffs2_build_filesystem(), the
jffs2_clear_xattr_subsystem() also does not need to be executed again.
So move jffs2_clear_xattr_subsystem() from 'out_inohash' to 'out_root'
to fix this UAF problem.
Fixes: aa98d7cf59 ("[JFFS2][XATTR] XATTR support on JFFS2 (version. 5)")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.
Signed-off-by: hongnanli <hongnan.li@linux.alibaba.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This serves two purposes:
- We now have the last cacheline mostly unused for generic workloads,
instead of having to pull in the poll refs explicitly for workloads
that rely on poll arming.
- It shrinks the io_kiocb from 232 to 224 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With all implementations converted to ->dirty_folio, we can stop calling
this fallback method and remove it entirely.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a mechanical change.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Convert all callers; mostly this is just changing the aops to point
at it, but a few implementations need a little more work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
The comment about the page always being locked is wrong, so copy
the locking protection from __set_page_dirty_buffers(). That
means moving the call to nilfs_set_file_dirty() down the
function so as to not acquire a new dependency between the
mapping->private_lock and the ns_inode_lock. That might be a
harmless dependency to add, but it's not necessary.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This converts the remaining inode lookup part by using metabuf in a
straight-forward way. Except that it doesn't use kmap_atomic()
anymore since we now have to maintain two metabufs together.
After this patch, all uncompressed paths are handled with metabuf
instead of page structure.
Link: https://lore.kernel.org/r/20220316012246.95131-2-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, directory inodes are directly handled with page cache
interfaces.
In order to support sub-page directory blocks and folios, let's
convert them into the latest metabuf infrastructure as well and
this patch addresses the readdir case first.
Link: https://lore.kernel.org/r/20220316012246.95131-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Syzkaller hit 'WARNING: kobject bug in erofs_unregister_sysfs'. This bug
is triggered by injecting fault in kobject_init_and_add of
erofs_unregister_sysfs.
Fix this by adding sanity check for kobject in erofs_unregister_sysfs
Note that I've tested the patch and the crash does not occur any more.
Link: https://lore.kernel.org/r/20220315132814.12332-1-dzm91@hust.edu.cn
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Fixes: 168e9a7620 ("erofs: add sysfs interface")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Avoid the unnecessary tail recursion since it can be converted into
a loop directly in order to prevent potential stack overflow.
It's a pretty straightforward conversion.
Link: https://lore.kernel.org/r/20220310182743.102365-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Dan reported two smatch warnings [1],
.. warn: should '1 << lclusterbits' be a 64 bit type?
.. warn: should 'm->compressedlcs << lclusterbits' be a 64 bit type?
In practice, m_plen cannot be more than 1MiB due to on-disk constraint
for the compression mode, so we're always safe here.
In order to make static analyzers happy and not report again, let's
silence them instead.
[1] https://lore.kernel.org/r/202203091002.lJVzsX6e-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220310173448.19962-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Rename preload_compressed_pages() as z_erofs_bind_cache()
since we're trying to prepare for adapting folios.
Also, add a comment for the gfp setting. No logic changes.
Link: https://lore.kernel.org/r/20220301194951.106227-2-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Avoid `struct z_erofs_collector' since there is another context
structure called "struct z_erofs_decompress_frontend".
No logic changes.
Link: https://lore.kernel.org/r/20220301194951.106227-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The only change is that, meta buffers read cache page without __GFP_FS
flag, which shall not matter.
Link: https://lore.kernel.org/r/20220209060108.43051-7-jefflexu@linux.alibaba.com
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Commit 84c4e1f89f ("aio: simplify - and fix - fget/fput for io_submit()")
refactored aio_read() and some error cases into early return, which made
some intermediate assignment of the return variable needless.
Drop this needless assignment in aio_read().
No functional change. No change in resulting object code.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
separate the "cleanup" and "apply" codepaths (they have almost no overlap),
fold the "cleanup" into "prepare" (which eliminates the need of ->revert)
and make loops more idiomatic.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds commit_tid info in ext4_fc_commit_start/stop which is helpful
in debugging fast_commit issues.
For e.g. issues where due to jbd2 journal full commit, FC miss to commit
updates to a file.
Also improves TP_prink format string i.e. all ext4 and jbd2 trace events
starts with "dev MAjOR,MINOR". Let's follow the same convention while we
are still at it.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/ebcd6b9ab5b718db30f90854497886801ce38c63.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This adds commit_tid argument in ext4_fc_update_stats()
so that we can add this information too in jbd_debug logs.
This is also required in a later patch to pass the commit_tid info in
ext4_fc_commit_start/stop() trace events.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/dabda3f2919a60e01887e798bf5915216b451733.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds the transaction & inode tid info in trace events for
callers of ext4_fc_track_template(). This is helpful in debugging race
conditions where an inode could belong to two different transaction tids.
It also fixes the checkpatch warnings which says use tabs instead of
spaces.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c203c09dc11bb372803c430f621f25a4b8c2c8b4.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently ext4_fc_track_template() checks, whether the trace event
path belongs to replay or does sb has ineligible set, if yes it simply
returns. This patch pulls those checks before calling
ext4_fc_track_template() in the callers of ext4_fc_track_template().
[ Add checks to ext4_rename() which calls the __ext4_fc_track_*()
functions directly. -- TYT ]
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/3cd025d9c490218a92e6d8fb30b6123e693373e3.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
pstore_dump() is *always* invoked in atomic context (nowadays in an RCU
read-side critical section, before that under a spinlock).
It doesn't make sense to try to use semaphores here.
This is mostly a revert of commit ea84b580b9 ("pstore: Convert buf_lock
to semaphore"), except that two parts aren't restored back exactly as they
were:
- keep the lock initialization in pstore_register
- in efi_pstore_write(), always set the "block" flag to false
- omit "is_locked", that was unnecessary since
commit 959217c84c ("pstore: Actually give up during locking failure")
- fix the bailout message
The actual problem that the buggy commit was trying to address may have
been that the use of preemptible() in efi_pstore_write() was wrong - it
only looks at preempt_count() and the state of IRQs, but __rcu_read_lock()
doesn't touch either of those under CONFIG_PREEMPT_RCU.
(Sidenote: CONFIG_PREEMPT_RCU means that the scheduler can preempt tasks in
RCU read-side critical sections, but you're not allowed to actively
block/reschedule.)
Lockdep probably never caught the problem because it's very rare that you
actually hit the contended case, so lockdep always just sees the
down_trylock(), not the down_interruptible(), and so it can't tell that
there's a problem.
Fixes: ea84b580b9 ("pstore: Convert buf_lock to semaphore")
Cc: stable@vger.kernel.org
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220314185953.2068993-1-jannh@google.com
Particularly for networked workloads, io_uring intensively uses its
poll based backend to get a notification when data/space is available.
Profiling workloads, we see 3-4% of alloc+free that is directly attributed
to just the apoll allocation and free (and the rest being skb alloc+free).
For the fast path, we have ctx->uring_lock held already for both issue
and the inline completions, and we can utilize that to avoid any extra
locking needed to have a basic recycling cache for the apoll entries on
both the alloc and free side.
Double poll still requires an allocation. But those are rare and not
a fast path item.
With the simple cache in place, we see a 3-4% reduction in overhead for
the workload.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Smatch complains:
fs/nfsd/nfsxdr.c:341 nfssvc_decode_writeargs()
warn: no lower bound on 'args->len'
Change the type to unsigned to prevent this issue.
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Removes a call to __set_page_dirty_nobuffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Removes a call to __set_page_dirty_nobuffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Removes several calls to __set_page_dirty_nobuffers(). Also turn the
PageSwapCache() case into a BUG() as there's no way for a swapcache page
to make it to a filesystem that doesn't use SWP_FS_OPS.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Removes a call to __set_page_dirty_nobuffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a trivial change.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This removes a call to __set_page_dirty_nobuffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Optimise the non-DEBUG case to just call filemap_dirty_folio
directly. The DEBUG case doesn't actually compile, but convert
it to dirty_folio anyway.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Convert all users of fscache_set_page_dirty to use fscache_dirty_folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
OrangeFS launders its pages from a number of locations, so add a
small amount of folio usage to its callers where it makes sense.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
We don't need to use page_file_mapping() here because launder_folio
is never called for swap cache pages. We also don't need to
cast an loff_t in order to print it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Straightforward conversion although the helper functions still assume
a single page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Trivial conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a straightfoward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Print the folio index instead of the pointer, since this is more
useful. We also don't need to use page_file_mapping() as we do not
invalidate swapcache pages. Since this is the only caller of
nfs_wb_page_cancel(), convert it to nfs_wb_folio_cancel().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a minimal change which just accepts the new arguments and passes
the single struct page to the functions which do the work. There is
very little progress here toards making f2fs support large folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Extensive changes, but fairly mechanical.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
A straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
A straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Mostly a straightforward conversion. Delete the pointer from the
debugging output as this has no value.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
A lot of the underlying infrastructure in btrfs needs to be switched
over to folios, but this at least documents that invalidatepage can't
be passed a tail page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
We know the page is in the page cache, not the swap cache. If we ever
support folios larger than 2GB, afs_invalidate_dirty() will need to be
fixed, but that's a larger project.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Use folio->index instead of folio_index() because there's no way we're
writing a page from the swapcache to a directory.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a trivial conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
We used to have to use noop_invalidatepage() to prevent
block_invalidatepage() from being called, but that behaviour is now gone.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Remove special-casing of a NULL invalidatepage, since there is no
more block_invalidatepage.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Use iomap_invalidate_folio() in all the iomap-based filesystems
and rename the iomap_invalidatepage tracepoint.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Instead of calling ->invalidatepage directly, use folio_invalidate().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Instead of calling ->invalidatepage directly, use folio_invalidate().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Instead of calling ->invalidatepage directly, use folio_invalidate().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Commit 0bf6276392 ("x32: Warn and disable rather than error if
binutils too old") added a small test in arch/x86/Makefile because
binutils 2.22 or newer is needed to properly support elf32-x86-64. This
check is no longer necessary, as the minimum supported version of
binutils is 2.23, which is enforced at configuration time with
scripts/min-tool-version.sh.
Remove this check and replace all uses of CONFIG_X86_X32 with
CONFIG_X86_X32_ABI, as two symbols are no longer necessary.
[nathan: Rebase, fix up a few places where CONFIG_X86_X32 was still
used, and simplify commit message to satisfy -tip requirements]
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220314194842.3452-2-nathan@kernel.org
Since the uptodate property is maintained on a per-folio basis, the
is_partially_uptodate method should also take a folio. Fix the types
at the same time so it's clear that it returns true/false and takes
the count in bytes, not blocks.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
We have the struct file in generic_remap_file_range_prep() already;
we just need to pass it around instead of the inode.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This tracepoint is defined to take an offset in the file, not an
offset in the folio.
Fixes: 1ac994525b ("iomap: Remove pgoff from tracepoints")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
The symbol xfs_name_dotdot is a global variable that the xfs codebase
uses here and there to look up directory dotdot entries. Currently it's
a non-const variable, which means that it's a mutable global variable.
So far nobody's abused this to cause problems, but let's use the
compiler to enforce that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Various directory functions do not modify their @name parameter,
so mark it const to make that clear. This will enable us to mark
the global xfs_name_dotdot variable as const to prevent mischief.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
XFS does not reserve quota for directory expansion when renaming
children into a directory. This means that we don't reject the
expansion with EDQUOT when we're at or near a hard limit, which means
that unprivileged userspace can use rename() to exceed quota.
Rename operations don't always expand the target directory, and we allow
a rename to proceed with no space reservation if we don't need to add a
block to the target directory to handle the addition. Moreover, the
unlink operation on the source directory generally does not expand the
directory (you'd have to free a block and then cause a btree split) and
it's probably of little consequence to leave the corner case that
renaming a file out of a directory can increase its size.
As with link and unlink, there is a further bug in that we do not
trigger the blockgc workers to try to clear space when we're out of
quota.
Because rename is its own special tricky animal, we'll patch xfs_rename
directly to reserve quota to the rename transaction. We'll leave
cleaning up the rest of xfs_rename for the metadata directory tree
patchset.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
XFS does not reserve quota for directory expansion when linking or
unlinking children from a directory. This means that we don't reject
the expansion with EDQUOT when we're at or near a hard limit, which
means that unprivileged userspace can use link()/unlink() to exceed
quota.
The fix for this is nuanced -- link operations don't always expand the
directory, and we allow a link to proceed with no space reservation if
we don't need to add a block to the directory to handle the addition.
Unlink operations generally do not expand the directory (you'd have to
free a block and then cause a btree split) and we can defer the
directory block freeing if there is no space reservation.
Moreover, there is a further bug in that we do not trigger the blockgc
workers to try to clear space when we're out of quota.
To fix both cases, create a new xfs_trans_alloc_dir function that
allocates the transaction, locks and joins the inodes, and reserves
quota for the directory. If there isn't sufficient space or quota,
we'll switch the caller to reservationless mode. This should prevent
quota usage overruns with the least restriction in functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Combine if tests to reduce the indentation levels of the quota chown
calls in xfs_setattr_nonsize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
revocation isn't consistent with btrfs[1] or ext4. Those two
filesystems use the VFS function setattr_copy to convey certain
attributes from struct iattr into the VFS inode structure.
Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
decide if it should clear setgid and setuid on a file attribute update.
This is a second symptom of the problem that Filipe noticed.
XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
xfs_setattr_nonsize, and xfs_setattr_time. Regrettably, setattr_copy is
/not/ a simple copy function; it contains additional logic to clear the
setgid bit when setting the mode, and XFS' version no longer matches.
The VFS implements its own setuid/setgid stripping logic, which
establishes consistent behavior. It's a tad unfortunate that it's
scattered across notify_change, should_remove_suid, and setattr_copy but
XFS should really follow the Linux VFS. Adapt XFS to use the VFS
functions and get rid of the old functions.
[1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/
Fixes: 7fa294c899 ("userns: Allow chown and setgid preservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
It's counter-intuitive (and wrong) to put the block group _before_ the
final usage in submit_eb_page. Fix it by re-ordering the call to
btrfs_put_block_group after its final reference. Also fix a minor typo
in 'implies'
Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported a possible use-after-free in printing information
in device_list_add.
Very similar with the bug fixed by commit 0697d9a610 ("btrfs: don't
access possibly stale fs_info data for printing duplicate device"),
but this time the use occurs in btrfs_info_in_rcu.
Call Trace:
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
btrfs_printk+0x395/0x425 fs/btrfs/super.c:244
device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957
btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387
btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix this by modifying device->fs_info to NULL too.
Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a previous patch ("btrfs: extend locking to all space_info members
accesses") the locking for the space_info members was extended in
btrfs_preempt_reclaim_metadata_space because not all the member
accesses that needed locks were actually locked (bytes_pinned et al).
It was then suggested to also add a call to lockdep_assert_held to
need_preemptive_reclaim. This function also works with space_info
members. As of now, it has only two call sites which both hold the lock.
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:
BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22
[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.
It's an obvious bitflip.
Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.
As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.
So such problem remain undetected.
[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.
Function csum_one_extent_buffer() is called before we submit metadata
write.
Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.
Using that we can catch the above bitflip.
Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one oddball error handling of btrfs_read_buffer():
ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key);
if (!ret) {
*eb_ret = tmp;
return 0;
}
free_extent_buffer(tmp);
btrfs_release_path(p);
return -EIO;
While all other call sites check the error first. Unify the behavior.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We had an error handling pattern for read_tree_block() like this:
eb = read_tree_block();
if (IS_ERR(eb)) {
/*
* Handling error here
* Normally ended up with return or goto out.
*/
} else if (!extent_buffer_uptodate(eb)) {
/*
* Different error handling here
* Normally also ended up with return or goto out;
*/
}
This is fine, but if we want to add extra check for each
read_tree_block(), the existing if-else-if is not that expandable and
will take reader some seconds to figure out there is no extra branch.
Here we change it to a more common way, without the extra else:
eb = read_tree_block();
if (IS_ERR(eb)) {
/*
* Handling error here
*/
return eb or goto out;
}
if (!extent_buffer_uptodate(eb)) {
/*
* Different error handling here
*/
return eb or goto out;
}
This also removes some oddball call sites which uses some creative way
to check error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_extent() does all of the hard work of updating the extent
ref items, and then at the end if we dropped the extent completely it
does the cleanup accounting work. We're going to only want to do that
work for metadata with extent tree v2, so extract this bit into its own
helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a remnant of the work I did for qgroups a long time ago to only
run for a block when we had dropped the last ref. We haven't done that
for years, but the code remains. Drop this remnant.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We duplicate this logic for both data and metadata, at this point we've
already done our type specific extent root operations, this is just
doing the accounting and removing the space from the free space tree.
Extract this common logic out into a helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Switch this to an ASSERT() and return the error in the normal case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay there is this pattern of running delayed items after
every inode unlink. To avoid repeating this several times, move the
logic into an helper function and use it instead of calling
btrfs_unlink_inode() followed by btrfs_run_delayed_items().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bytes_pinned is always accessed under space_info->lock, except in
btrfs_preempt_reclaim_metadata_space, however the other members are
accessed under that lock. The reserved member of the rsv's are also
partially accessed under a lock and partially not. Move all these
accesses into the same lock to ensure consistency.
This could potentially race and lead to a flush instead of a commit but
it's not a big problem as it's only for preemptive flush.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a hung_task issue with running generic/068 on an SMR
device. The hang occurs while a process is trying to thaw the
filesystem. The process is trying to take sb->s_umount to thaw the
FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
waiting for an ordered extent to finish. However, as the FS is frozen,
the ordered extents never finish.
Having an ordered extent while the FS is frozen is the root cause of
the hang. The ordered extent is initiated from btrfs_relocate_chunk()
which is called from btrfs_reclaim_bgs_work().
This commit adds sb_*_write() around btrfs_relocate_chunk() call
site. For the usual "btrfs balance" command, we already call it with
mnt_want_file() in btrfs_ioctl_balance().
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.13+
Link: https://github.com/naota/linux/issues/56
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we disallow reflink and dedupe if the two files aren't on the
same vfsmount. However we really only need to disallow it if they're
not on the same super block. It is very common for btrfs to have a main
subvolume that is mounted and then different subvolumes mounted at
different locations. It's allowed to reflink between these volumes, but
the vfsmount check disallows this. Instead fix dedupe to check for the
same superblock, and simply remove the vfsmount check for reflink as it
already does the superblock check.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The sb check is already done in do_clone_file_range, and the mnt check
(which will hopefully go away in a subsequent patch) is done in
ioctl_file_clone(). Remove the check in our code and put an ASSERT() to
make sure it doesn't change underneath us.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need a root here, we just need the btrfs_fs_info, we can just
get the specific roots we need from fs_info.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're passing a root around here, but we only really need the fs_info,
so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead,
and then fix up all the callers appropriately.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a filesystem goes read-only due to an error, multiple errors tend
to be reported, some of which are knock-on failures. Logging fs_states,
in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the
first error from subsequent messages which may only exist due to an
error state.
Under the new format, most initial errors will look like:
`BTRFS: error (device loop0) in ...`
while subsequent errors will begin with:
`error (device loop0: state E) in ...`
An initial transaction abort error will look like
`error (device loop0: state A) in ...`
and subsequent messages will contain
`(device loop0: state EA) in ...`
In addition to the error states we can also print other states that are
temporary, like remounting, device replace, or indicate a global state
that may affect functionality.
Now implemented:
E - filesystem error detected
A - transaction aborted
L - log tree errors
M - remounting in progress
R - device replace in progress
C - data checksums not verified (mounted with ignoredatacsums)
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Smatch complains about a possible dereference of a pointer that was not
initialized:
CC [M] fs/btrfs/reflink.o
CHECK fs/btrfs/reflink.c
fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'.
This is because we are not dealing with the case where the type of a file
extent has an unexpected value (not regular, not prealloc and not inline),
in which case the transaction handle pointer is not initialized.
Such unexpected type should be impossible, except in case of some memory
corruption caused either by bad hardware or some software bug causing
something like a buffer overrun.
So ASSERT that if the extent type is neither regular nor prealloc, then
it must be inline. Bail out with -EUCLEAN and a warning in case it is
not. This silences smatch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reflinking an inline extent, we assert that its file offset is 0 and
that its uncompressed length is not greater than the sector size. We then
return an error if one of those conditions is not satisfied. However we
use a return statement, which results in returning from btrfs_clone()
without freeing the path and buffer that were allocated before, as well as
not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination
inode.
Fix that by jumping to the 'out' label instead, and also add a WARN_ON()
for each condition so that in case assertions are disabled, we get to
known which of the unexpected conditions triggered the error.
Fixes: a61e1e0df9 ("Btrfs: simplify inline extent handling when doing reflinks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When an inode has a last_reflink_trans matching the current transaction,
we have to take special care when logging its checksums in order to
avoid getting checksum items with overlapping ranges in a log tree,
which could result in missing checksums after log replay (more on that
in the changelogs of commit 40e046acbd ("Btrfs: fix missing data
checksums after replaying a log tree") and commit e289f03ea7 ("btrfs:
fix corrupt log due to concurrent fsync of inodes with shared extents")).
We also need to make sure a full fsync will copy all old file extent
items it finds in modified leaves, because they might have been copied
from some other inode.
However once we fsync an inode, we don't need to keep paying the price of
that extra special care in future fsyncs done in the same transaction,
unless the inode is used for another reflink operation or the full sync
flag is set on it (truncate, failure to allocate extent maps for holes,
and other exceptional and infrequent cases).
So after we fsync an inode reset its last_unlink_trans to zero. In case
another reflink happens, we continue to update the last_reflink_trans of
the inode, just as before. Also set last_reflink_trans to the generation
of the last transaction that modified the inode whenever we need to set
the full sync flag on the inode, just like when we need to load an inode
from disk after eviction.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Doing a full fsync may require processing many leaves of metadata, which
can take some time and result in a task monopolizing a cpu for too long.
So add a cond_resched() after processing a leaf when doing a full fsync,
while not holding any locks on any tree (a subvolume or a log tree).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a full fsync, at copy_items(), we iterate over all extents and
then collect their checksums into a list. After copying all the extents to
the log tree, we then log all the previously collected checksums.
Before the previous patch in the series (subject "btrfs: stop copying old
file extents when doing a full fsync"), we had to do it this way, because
while we were iterating over the items in the leaf of the subvolume tree,
we were holding a write lock on a leaf of the log tree, so logging the
checksums for an extent right after we collected them could result in a
deadlock, in case the checksum items ended up in the same leaf.
However after the previous patch in the series we now do a first iteration
over all the items in the leaf of the subvolume tree before locking a path
in the log tree, so we can now log the checksums right after we have
obtained them. This avoids holding in memory all checksums for all extents
in the leaf while copying items from the source leaf to the log tree. The
amount of memory used to hold all checksums of the extents in a leaf can
be significant. For example if a leaf has 200 file extent items referring
to 1M extents, using the default crc32c checksums, would result in using
over 200K of memory (not accounting for the extra overhead of struct
btrfs_ordered_sum), with smaller or less extents it would be less, but
it could be much more with more extents per leaf and/or much larger
extents.
So change copy_items() to log the checksums for an extent after looking
them up, and then free their memory, as they are no longer necessary.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode in full sync mode, we go over every leaf that was
modified in the current transaction and has items associated to our inode,
and then copy all those items into the log tree. This includes copying
file extent items that were created and added to the inode in past
transactions, which is useless and only makes use more leaf space in the
log tree.
It's common to have a file with many file extent items spanning many
leaves where only a few file extent items are new and need to be logged,
and in such case we log all the file extent items we find in the modified
leaves.
So change the full sync behaviour to skip over file extent items that are
not needed. Those are the ones that match the following criteria:
1) Have a generation older than the current transaction and the inode
was not a target of a reflink operation, as that can copy file extent
items from a past generation from some other inode into our inode, so
we have to log them;
2) Start at an offset within i_size - we must log anything at or beyond
i_size, otherwise we would lose prealloc extents after log replay.
The following script exercises a scenario where this happens, and it's
somehow close enough to what happened often on a SQL Server workload which
I had to debug sometime ago to fix an issue where a pattern of writes to
prealloc extents and fsync resulted in fsync failing with -EIO (that was
commit ea7036de0d ("btrfs: fix fsync failure and transaction abort
after writes to prealloc extents")). In that particular case, we had large
files that had random writes and were often truncated, which made the
next fsync be a full sync.
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
MKFS_OPTIONS="-O no-holes -R free-space-tree"
MOUNT_OPTIONS="-o ssd"
FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
# FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
# FILE_SIZE=$((512 * 1024 * 1024)) # 512M
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
# Create a file with many extents. Use direct IO to make it faster
# to create the file - using buffered IO we would have to fsync
# after each write (terribly slow).
echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
# Commit the transaction, so every extent after this is from an
# old generation.
sync
# Now rewrite only a few extents, which are all far spread apart from
# each other (e.g. 1G / 32M = 32 extents).
# After this only a few extents have a new generation, while all other
# ones have an old generation.
echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
done
# Fsync, the inode logged in full sync mode since it was never fsynced
# before.
echo "Fsyncing file..."
xfs_io -c "fsync" $MNT/foobar
umount $MNT
And the following bpftrace program was running when executing the test
script:
$ cat bpf-script.sh
#!/usr/bin/bpftrace
k:btrfs_log_inode
{
@start_log_inode[tid] = nsecs;
}
kr:btrfs_log_inode
/@start_log_inode[tid]/
{
@log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
delete(@start_log_inode[tid]);
}
k:btrfs_sync_log
{
@start_sync_log[tid] = nsecs;
}
kr:btrfs_sync_log
/@start_sync_log[tid]/
{
$sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
printf("btrfs_sync_log() took %llu us\n", $sync_log_dur);
delete(@start_sync_log[tid]);
delete(@log_inode_dur[tid]);
exit();
}
With 512M test file, before this patch:
btrfs_log_inode() took 15218 us
btrfs_sync_log() took 1328 us
Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
With 512M test file, after this patch:
btrfs_log_inode() took 14760 us
btrfs_sync_log() took 588 us
Log tree has a single leaf, its total size is 16K.
With 1G test file, before this patch:
btrfs_log_inode() took 27301 us
btrfs_sync_log() took 1767 us
Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
With 1G test file, after this patch:
btrfs_log_inode() took 26166 us
btrfs_sync_log() took 593 us
Log tree has a single leaf, its total size is 16K
With 2G test file, before this patch:
btrfs_log_inode() took 50892 us
btrfs_sync_log() took 3127 us
Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
With 2G test file, after this patch:
btrfs_log_inode() took 50126 us
btrfs_sync_log() took 586 us
Log tree has a single leaf, its total size is 16K.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The submit helper will always run bio_endio() on the bio if it fails to
submit, so cleaning up the bio just leads to a variety of use-after-free
and NULL pointer dereference bugs because we race with the endio
function that is cleaning up the bio. Instead just return BLK_STS_OK as
the repair function has to continue to process the rest of the pages,
and the endio for the repair bio will do the appropriate cleanup for the
page that it was given.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to submit a bio for whatever reason, we may not have setup a
mirror_num for that bio. This means we shouldn't try to do the repair
workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in
clean_io_failure. Instead simply skip the repair workflow if we have no
mirror set, and add an assert to btrfs_check_repairable() to make it
easier to catch what is happening in the future.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit some weird panics while fixing up the error handling from
btrfs_lookup_bio_sums(). Turns out the compression path will complete
the bio we use if we set up any of the compression bios and then return
an error, and then btrfs_submit_data_bio() will also call bio_endio() on
the bio.
Fix this by making btrfs_submit_compressed_read() responsible for
calling bio_endio() on the bio if there are any errors. Currently it
was only doing it if we created the compression bios, otherwise it was
depending on btrfs_submit_data_bio() to do the right thing. This
creates the above problem, so fix up btrfs_submit_compressed_read() to
always call bio_endio() in case of an error, and then simply return from
btrfs_submit_data_bio() if we had to call
btrfs_submit_compressed_read().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we just have a binary "errors" flag, so any error we get on
the compressed bio's gets translated to EIO. This isn't necessarily a
bad thing, but if we get an ENOMEM it may be nice to know that's what
happened instead of an EIO. Track our errors as a blk_status_t, and do
the appropriate setting of the errors accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This bio is usually one of the compressed bio's, and we don't actually
need it in this function, so remove the argument and stop passing it
around.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit c09abff87f ("btrfs: cloned bios must not be iterated by
bio_for_each_segment_all") added ASSERT()'s to make sure we weren't
calling bio_for_each_segment_all() on a RAID5/6 bio. However it was
checking the bio that the compression code passed in, not the
cb->orig_bio that we actually iterate over, so adjust this ASSERT() to
check the correct bio.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently any error we get while trying to lookup csums during reads
shows up as a missing csum, and then on the read completion side we
print an error saying there was a csum mismatch and we increase the
device corruption count.
However we could have gotten an EIO from the lookup. We could also be
inside of a memory constrained container and gotten a ENOMEM while
trying to do the read. In either case we don't want to make this look
like a file system corruption problem, we want to make it look like the
actual error it is. Capture any negative value, convert it to the
appropriate blk_status_t, free the csum array if we have one and bail.
Note: a possible improvement would be to make the relocation code look
up the owning inode and see if it's marked as NODATASUM and set
EXTENT_NODATASUM there, that way if there's corruption and there isn't a
checksum when we want it we can fail here rather than later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can either fail to find a csum entry at all and return -ENOENT, or we
can find a range that is close, but return -EFBIG. In essence these
both mean the same thing when we are doing a lookup for a csum in an
existing range, we didn't find a csum. We want to treat both of these
errors the same way, complain loudly that there wasn't a csum. This
currently happens anyway because we do
count = search_csum_tree();
if (count <= 0) {
// reloc and error handling
}
However it forces us to incorrectly treat EIO or ENOMEM errors as on
disk corruption. Fix this by returning 0 if we get either -ENOENT or
-EFBIG from btrfs_lookup_csum() so we can do proper error handling.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
The start parameter to cow_file_range_inline() (and
insert_inline_extent()) is always 0, so get rid of it and simplify the
logic in those two functions. Pass btrfs_inode to insert_inline_extent()
and remove the redundant root parameter. Also document the requirements
for creating an inline extent. No functional change.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
the extent on disk, which may be less than or greater than (for
bookends) the size in the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we only create ordered extents when ram_bytes == num_bytes
and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create
extents which only refer to a subset of the full unencoded extent, so we
need to plumb these fields through the ordered extent infrastructure and
pass them down to insert_reserved_file_extent().
Since we're changing the btrfs_add_ordered_extent* signature, let's get
rid of the trivial wrappers and add a kernel-doc.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because:
1) the compressed bio always covers one ordered extent
2) the size of the bio is always less than the size of the ordered
extent
However, the second point will not always be true for encoded writes.
Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Encoded I/O in Btrfs needs to check a write with a given logical size
without an iov_iter that matches that size (because the iov_iter we have
is for the compressed data). So, factor out the parts of
generic_write_check() that don't need an iov_iter into a new
generic_write_checks_count() function and export that.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I'm adding btrfs ioctls to read and write compressed data, and rather
than duplicating the checks in rw_verify_area(), let's just export it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These comments are old, outdated and not very specific. It seems that it
doesn't help to inspire anybody to work on that. So we remove them.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Removes duplicated check when adding qgroup relations.
btrfs_add_qgroup_relations function adds relations by calling
add_relation_rb(). add_relation_rb() checks that member/parentid exists
in current qgroup_tree. But it already checked before calling the
function. It seems that we don't need to double check.
Add new function __add_relation_rb() that adds relations with
qgroup structures and makes old function use the new one. And it makes
btrfs_add_qgroup_relation() function work without double checks by
calling the new function.
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
It makes it more readable for length checking and is be used repeatedly.
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_get_extent() tries to get some file extent from disk, it
never populates extent_map::generation, leaving the value to be 0.
On the other hand, for extent map generated by IO, it will get its
generation properly set at finish_ordered_io()
finish_ordered_io()
|- unpin_extent_cache(gen = trans->transid)
|- em->generation = gen;
[CAUSE]
Since extent_map::generation is mostly used by fsync code, and for fsync
they only care about modified extents, which all have their
em::generation > 0.
Thus it's fine to not populate em read from disk for fsync.
[CORNER CASE]
However autodefrag also relies on em::generation to determine if one
extent needs to be defragged.
This unpopulated extent_map::generation can prevent the following
autodefrag case from working:
mkfs.btrfs -f $dev
mount $dev $mnt -o autodefrag
# initial write to queue the inode for autodefrag
xfs_io -f -c "pwrite 0 4k" $mnt/file
sync
# Real fragmented write
xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file
sync
echo "=== before autodefrag ==="
xfs_io -c "fiemap -v" $mnt/file
# Drop cache to force em to be read from disk
echo 3 > /proc/sys/vm/drop_caches
mount -o remount,commit=1 $mnt
sleep 3
sync
echo "=== After autodefrag ==="
xfs_io -c "fiemap -v" $mnt/file
umount $mnt
The result looks like this:
=== before autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
=== After autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
This fragmented 32K will not be defragged by autodefrag.
[FIX]
To make things less weird, just populate extent_map::generation when
reading file extents from disk.
This would make above fragmented extents to be properly defragged:
== before autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
=== After autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..63]: 26688..26751 64 0x1
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Removing or replacing an extent map requires holding a write lock on the
extent map's tree. We currently do that everywhere, except in one of the
self tests, where it's harmless since there's no concurrency.
In order to catch possible races in the future, assert that we are holding
a write lock on the extent map tree before removing or replacing an extent
map in the tree, and update the self test to obtain a write lock before
removing extent maps.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit 92082d4097 ("btrfs: integrate page status update for
data read path into begin/end_page_read"), the 'nr' counter at
btrfs_do_readpage() is no longer used, we increment it but we never
read from it. So just remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_do_readpage(), if we get an error when trying to lookup for an
extent map, we end up marking the page with the error bit, clearing
the uptodate bit on it, and doing everything else that should be done.
However we return success (0) to the caller, when we should return the
error encoded in the extent map pointer. So fix that by returning the
error encoded in the pointer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At extent_io.c, in the read page and write page code paths, we are testing
if the return value from btrfs_get_extent() can be NULL. However that is
not possible, as btrfs_get_extent() always returns either an error pointer
or a (non-NULL) pointer to an extent map structure.
Everywhere else outside extent_io.c we never check for NULL, we always
treat any returned value as a non-NULL pointer if it does not encode an
error.
So check only for the IS_ERR() case at extent_io.c.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we want to log an extent, in the fast fsync path, we obtain a path
to the leaf that will hold the file extent item either through a deletion
search, via btrfs_drop_extents(), or through an insertion search using
btrfs_insert_empty_item(). After that we fill the file extent item's
fields one by one directly on the leaf.
Instead of doing that, we could prepare the file extent item before
obtaining a btree path, and then copy the prepared extent item with a
single operation once we get the path. This helps avoid some contention
on the log tree, since we are holding write locks for longer than
necessary, especially in the case where the path is obtained via
btrfs_drop_extents() through a deletion search, which always keeps a
write lock on the nodes at levels 1 and 2 (besides the leaf).
This change does that, we prepare the file extent item that is going to
be inserted before acquiring a path, and then copy it into a leaf using
a single copy operation once we get a path.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The following test was run to measure the impact of the whole patchset:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=8
FILE_SIZE=128M
RUN_TIME=200
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
ioengine=sync
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test ran inside a VM (8 cores, 32G of RAM) with the target disk
mapping to a raw NVMe device, and using a non-debug kernel config
(Debian's default config).
Before the patchset:
WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
After the patchset:
WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
A 7.8% gain on throughput and +7.0% more IO done in the same period of
time (200 seconds).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in calling btrfs_release_path() after finishing the loop
that logs the modified extents, since log_one_extent() returns with the
path released. In case the list of extents is empty, the path is already
released, so there's no need for that case as well.
So just remove that unnecessary btrfs_release_path() call.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_drop_extents(), we try to replace a range of file extent items
with a new file extent in a single btree search, to avoid the need to do
a search for deletion, followed by a path release and followed by yet
another search for insertion.
When I originally added that optimization, in commit 1acae57b16
("Btrfs: faster file extent item replace operations"), I left a constraint
to do the fast replace only if we visited a single leaf. That was because
in the most common case we find all file extent items that need to be
deleted (or trimmed) in a single leaf, however it can work for other
common cases like when we need to delete a few file extent items located
at the end of a leaf and a few more located at the beginning of the next
leaf. The key for the new file extent item is greater than the key of
any deleted or trimmed file extent item from previous leaves, so we are
fine to use the last leaf that we found as long as we are holding a
write lock on it - even if the new key ends up at slot 0, as if that's
the case, the btree search has obtained a write lock on any upper nodes
that need to have a key pointer updated.
So removed the constraint that limits the optimization to the case where
we visited only a single leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deleting items from a leaf, we always compute the sum of the data
sizes of the items that are going to be deleted. However we only use
that sum when the last item to delete is behind the last item in the
leaf. This unnecessarily wastes CPU time when we are deleting either
the whole leaf or from some slot > 0 up to the last item in the leaf,
and both of these cases are common (e.g. truncation operation, either
as a result of truncate(2) or when logging inodes, deleting checksums
after removing a large enough extent, etc).
So compute only the sum of the data sizes if the last item to be
deleted does not match the last item in the leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trying to push items from a leaf into its left and right neighbours,
we lock the left or right leaf, check if it has the required minimum free
space, COW the leaf and then check again if it has the minimum required
free space. This second check is pointless:
1) Most and foremost because it's not needed. We have a write lock on the
leaf and on its parent node, so no one can come in and change either
the pre-COW or post-COW version of the leaf for the whole duration of
the push_leaf_left() and push_leaf_right() calls;
2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
amount of arithmetic operations and access to fields in the leaf's
header and items, so it's not very cheap.
So remove the duplicated free space checks.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In get_extent_skip_holes() we're checking the return of
btrfs_get_extent_fiemap() for an error pointer or NULL, but
btrfs_get_extent_fiemap() will never return NULL, only error pointers or
a valid extent_map.
The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
correctly only checks for error-pointers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the redundant assignment to zone_info variable in
btrfs_check_zoned_mode function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The static_assert introduced in 6bab69c650 ("build_bug.h: add wrapper
for _Static_assert") has been supported by compilers for a long time
(gcc 4.6, clang 3.0) and can be used in header files. We don't need to
put BUILD_BUG_ON to random functions but rather keep it next to the
definition.
The exception here is the UAPI header btrfs_tree.h that could be
potentially included by userspace code and the static assert is not
defined (nor used in any other header).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allow creating or reading block-groups on a zoned device with DUP as a
meta-data profile.
This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
operations for meta-data on zoned btrfs, so all writes to meta-data zones
are aligned to the zone's write-pointer.
Upon loading of the block-group, it is ensured both zones do have the same
zone capacity and write-pointer offsets, so no extra machinery is needed
to keep the write-pointers in sync for the meta-data zones. If this
prerequisite is not met, loading of the block-group is refused.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allow for a block-group to be placed on more than one physical zone.
This is a preparation for allowing DUP profiles for meta-data on a zoned
file-system.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently finishing of a zone only works if the block group isn't
spanning more than one zone.
This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.
This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently activation of a zone only works if the block group isn't
spanning more than one zone.
This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.
This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 you will be able to create multiple csum, extent,
and free space trees. They will be used based on the block group, which
will now use the block_group_item->chunk_objectid to point to the set of
global roots that it will use. When allocating new block groups we'll
simply mod the gigabyte offset of the block group against the number of
global roots we have and that will be the block groups global id.
>From there we can take the bytenr that we're modifying in the respective
tree, look up the block group and get that block groups corresponding
global root id. From there we can get to the appropriate global root
for that bytenr.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This code adds the on disk structures for the block group root, which
will hold the block group items for extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to be adding more roots that need to be loaded from the
super block, so abstract out the code to read the tree_root from the
super block, and use this helper for the chunk root as well. This will
make it simpler to load the new trees in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For extent tree v2 we can definitely have empty extent roots, so skip
this particular check if we have that set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We cannot fall back on the slow caching for extent tree v2, which means
we can't just arbitrarily clear the free space trees at mount time.
Furthermore we can't do v1 space cache with extent tree v2. Simply
ignore these mount options for extent tree v2 as they aren't relevant.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we stop tracking metadata blocks all of snapshotting will break, so
disable it until I add the snapshot root and drop tree support.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Scrub depends on extent references for every block, and with extent tree
v2 we won't have that, so disable scrub until we can add back the proper
code to handle extent-tree-v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backref lookups are going to be drastically different with extent tree
v2, disable qgroups until we do the work to add this support for extent
tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Device add, remove, and replace all require balance, which doesn't work
right now on extent tree v2, so disable these for now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With global root id's it makes it problematic to do backref lookups for
balance. This isn't hard to deal with, but future changes are going to
make it impossible to lookup backrefs on any COWonly roots, so go ahead
and disable balance for now on extent tree v2 until we can add balance
support back in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the initial definition of the EXTENT_TREE_V2 incompat feature
flag. This also hides the support behind CONFIG_BTRFS_DEBUG.
THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
DEVELOPER OR A TESTER.
The format is in flux and will be added in stages, any fs will need to
be re-made between updates to the format.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_log_inode(), we have two variables to track errors and the
return value of the function, named 'ret' and 'err'. In some places we
use 'ret' and if gets a non-zero value we assign its value to 'err'
and then jump to the 'out' label, while in other places we use 'err'
directly without 'ret' as an intermediary. This is inconsistent, error
prone and not necessary. So change that to use only the 'ret' variable,
making this consistent with most functions in btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a rename or link operation, we need to determine if an inode was
previously logged or not, and if it was, do some update to the logged
inode. We used to rely exclusively on the logged_trans field of struct
btrfs_inode to determine that, but that was not reliable because the
value of that field is not persisted in the inode item, so it's lost
when an inode is evicted and loaded back again. That led to several
issues in the past, such as not persisting deletions (such as the case
fixed by commit 803f0f64d1 ("Btrfs: fix fsync not persisting dentry
deletions due to inode evictions")), or resulting in losing a file
after an inode eviction followed by a rename (commit ecc64fab7d
("btrfs: fix lost inode on log replay after mix of fsync, rename and
inode eviction")), besides other issues.
So the inode_logged() helper was introduced and used to determine if an
inode was possibly logged before in the current transaction, with the
caveat that it could return false positives, in the sense that even if an
inode was not logged before in the current transaction, it could still
return true, but never to return false in case the inode was logged.
>From a functional point of view that is fine, but from a performance
perspective it can introduce significant latencies to rename and link
operations, as they will end up doing inode logging even when it is not
necessary.
Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
installations and upgrades, with the zypper tool, were often taking a
long time to complete. With strace it could be observed that zypper was
spending about 99% of its time on rename operations, and then with
further analysis we checked that directory logging was happening too
frequently. Taking into account that installation/upgrade of some of the
packages needed a few thousand file renames, the slowdown was very
noticeable for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
a 5.16-rc8 kernel. While triggering the inode evictions if something
outside btrfs' control, btrfs could still behave better by eliminating
the false positives from the inode_logged() helper.
So change inode_logged() to actually eliminate such false positives caused
by inode eviction and when an inode was never logged since the filesystem
was mounted, as both cases relate to when the logged_trans field of struct
btrfs_inode has a value of zero. When it can not determine if the inode
was logged based only on the logged_trans value, lookup for the existence
of the inode item in the log tree - if it's there then we known the inode
was logged, if it's not there then it can not have been logged in the
current transaction. Once we determine if the inode was logged, update
the logged_trans value to avoid future calls to have to search in the log
tree again.
Alternatively, we could start storing logged_trans in the on disk inode
item structure (struct btrfs_inode_item) in the unused space it still has,
but that would be a bit odd because:
1) We only care about logged_trans since the filesystem was mounted, we
don't care about its value from a previous mount. Having it persisted
in the inode item structure would not make the best use of the precious
unused space;
2) In order to get logged_trans persisted before inode eviction, we would
have to update the delayed inode when we finish logging the inode and
update its logged_trans in struct btrfs_inode, which makes it a bit
cumbersome since we need to check if the delayed inode exists, if not
create it and populate it and deal with any errors (-ENOMEM mostly).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
The following test script mimics part of what the zypper tool does during
package installations/upgrades. It does not triggers inode evictions, but
it's similar because it triggers false positives from the inode_logged()
helper, because the inodes have a logged_trans of 0, there's a log tree
due to a fsync of an unrelated file and the directory inode has its
last_trans field set to the current transaction:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Now do some change to an unrelated file and fsync it.
# This is just to create a log tree to make sure that inode_logged()
# does not return false when called against "testdir".
xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
# Do some change to testdir. This is to make sure inode_logged()
# will return true when called against "testdir", because its
# logged_trans is 0, it was changed in the current transaction
# and there's a log tree.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on a box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27837 ms
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%)
NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%)
NUM_FILES=5000, before patchset: 9127 ms
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%)
NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%)
NUM_FILES=2000, before patchset: 2528 ms
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%)
NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%)
NUM_FILES=1000, before patchset: 1085 ms
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%)
NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%)
Running dbench on the same physical machine with the following script:
$ cat run-dbench.sh
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 3761352 0.032 143.843
Close 2762770 0.002 2.273
Rename 159304 0.291 67.037
Unlink 759784 0.207 143.998
Deltree 72 4.028 15.977
Mkdir 36 0.003 0.006
Qpathinfo 3409780 0.013 9.678
Qfileinfo 596772 0.001 0.878
Qfsinfo 625189 0.003 1.245
Sfileinfo 306443 0.006 1.840
Find 1318106 0.063 19.798
WriteX 1871137 0.021 8.532
ReadX 5897325 0.003 3.567
LockX 12252 0.003 0.258
UnlockX 12252 0.002 0.100
Flush 263666 3.327 155.632
Throughput 980.047 MB/sec 12 clients 12 procs max_latency=155.636 ms
After whole patchset applied:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4195584 0.033 107.742
Close 3081932 0.002 1.935
Rename 177641 0.218 14.905
Unlink 847333 0.166 107.822
Deltree 118 5.315 15.247
Mkdir 59 0.004 0.048
Qpathinfo 3802612 0.014 10.302
Qfileinfo 666748 0.001 1.034
Qfsinfo 697329 0.003 0.944
Sfileinfo 341712 0.006 2.099
Find 1470365 0.065 9.359
WriteX 2093921 0.021 8.087
ReadX 6576234 0.003 3.407
LockX 13660 0.003 0.308
UnlockX 13660 0.002 0.114
Flush 294090 2.906 115.539
Throughput 1093.11 MB/sec 12 clients 12 procs max_latency=115.544 ms
+11.5% throughput -25.8% max latency rename max latency -77.8%
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a rename, we call __btrfs_unlink_inode(), which will call
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
to remove an inode reference and a directory entry from the log. These
are necessary when __btrfs_unlink_inode() is called from the unlink path,
but not necessary when it's called from a rename context, because:
1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
inode reference related to the old name, because later in the rename
path we call btrfs_log_new_name(), which will drop all inode references
from the log and copy all inode references from the subvolume tree to
the log tree. So we are doing one unnecessary btree operation which
adds additional latency and lock contention in case there are other
tasks accessing the log tree;
2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
equivalent at btrfs_log_new_name() since the previous patch in the
series, that has the subject "btrfs: avoid logging all directory
changes during renames". In fact, having __btrfs_unlink_inode() call
this function not only adds additional latency and lock contention due
to the extra btree operation, but also can make btrfs_log_new_name()
unnecessarily log a range item to track the deletion of the old name,
since it has no way to known that the directory entry related to the
old name was previously logged and already deleted by
__btrfs_unlink_inode() through its call to
btrfs_del_dir_entries_in_log().
So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
Skipping them also allows us now to reduce the duration of time we are
pinning a log transaction during renames, which is always beneficial as
it's not delaying so much other tasks trying to sync the log tree, in
particular we end up not holding the log transaction pinned while adding
the new name (adding inode ref, directory entry, etc).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
Just like the previous patch in the series, "btrfs: avoid logging all
directory changes during renames", the following script mimics part of
what a package installation/upgrade with zypper does, which is basically
renaming a lot of files, in some directory under /usr, to a name with a
suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on box a using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27399 ms
NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%)
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%)
NUM_FILES=5000, before patchset: 9241 ms
NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%)
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%)
NUM_FILES=2000, before patchset: 2550 ms
NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%)
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%)
NUM_FILES=1000, before patchset: 1088 ms
NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%)
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%)
The next patch in the series (5/5), also contains dbench results after
applying to whole patchset.
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a rename of a file, if the file or its old parent directory
were logged before, we log the new name of the file and then make sure
we log the old parent directory, to ensure that after a log replay the
old name of the file is deleted and the new name added.
The logging of the old parent directory can take some time, because it
will scan all leaves modified in the current transaction, check which
directory entries were already logged, copy the ones that were not
logged before, etc. In this rename context all we need to do is make
sure that the old name of the file is deleted on log replay, so instead
of triggering a directory log operation, we can just delete the old
directory entry from the log if it's there, or in case it isn't there,
just log a range item to signal log replay that the old name must be
deleted. So change btrfs_log_new_name() to do that.
This scenario is actually not uncommon to trigger, and recently on a
5.15 kernel, an openSUSE Tumbleweed user reported package installations
and upgrades, with the zypper tool, were often taking a long time to
complete, much more than usual. With strace it could be observed that
zypper was spending over 99% of its time on rename operations, and then
with further analysis we checked that directory logging was happening
too frequently and causing high latencies for the rename operations.
Taking into account that installation/upgrade of some of these packages
needed about a few thousand file renames, the slowdown was very noticeable
for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
in an efficient way, if an inode was previously logged in the current
transaction, so we are pessimistic and assume it was, because in case
it was we need to update the logged inode. More details on that in one
of the patches in the same series (subject "btrfs: avoid inode logging
during rename and link when possible"). Either way, in case the parent
directory was logged before, we currently do more work then necessary
during a rename, and this change minimizes that amount of work.
The following script mimics part of what a package installation/upgrade
with zypper does, which is basically renaming a lot of files, in some
directory under /usr, to a name with a suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before this patch: 27399 ms
NUM_FILES=10000, after this patch: 9093 ms (-66.8%)
NUM_FILES=5000, before this patch: 9241 ms
NUM_FILES=5000, after this patch: 4642 ms (-49.8%)
NUM_FILES=2000, before this patch: 2550 ms
NUM_FILES=2000, after this patch: 1788 ms (-29.9%)
NUM_FILES=1000, before this patch: 1088 ms
NUM_FILES=1000, after this patch: 905 ms (-16.9%)
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the next patch in the series, there will be the need to access the old
name, and its length, of an inode when logging the inode during a rename.
So instead of passing the inode to btrfs_log_new_name() pass the dentry,
because from the dentry we can get the inode, the name and its length.
This will avoid passing 3 new parameters to btrfs_log_new_name() in the
next patch - the name, its length and an index number. This way we end
up passing only 1 new parameter, the index number.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the code that finds and deletes a logged dir entry out of
btrfs_del_dir_entries_in_log() into a helper function. This new helper
function will be used by another patch in the same series, and serves
to avoid having duplicated logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Return value from fs_path_add_path() directly instead of taking this in
another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of having 2 places that short circuit the qgroup leaf scan have
everything in the qgroup_rescan_leaf function. In addition to that, also
ensure that the inconsistent qgroup flag is set when rescan_should_stop
returns true. This both retains the old behavior when -EINTR was set in
the body of the loop and at the same time also extends this behavior
when scanning is interrupted due to remount or unmount operations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
increment is being initialized to map->stripe_len but this is never
read as increment is overwritten later on. Remove the redundant
initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/scrub.c:3193:6: warning: Value stored to 'increment' during its
initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
to_add is being initialized to len but this is never read as to_add is
overwritten later on. Remove the redundant initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/extent-tree.c:2769:8: warning: Value stored to 'to_add' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pointer to struct request_queue is used only to get device type
rotating or the non-rotating. So use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit "btrfs: add device major-minor info in the struct btrfs_device"
saved the device major-minor number in the struct btrfs_device upon
discovering it.
So no need to lookup_bdev() again just match, which means
device_matched() can go away.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Internally it is common to use the major-minor number to identify a
device and, at a few locations in btrfs, we use the major-minor number
to match the device.
So when we identify a new btrfs device through device add or device
replace or device-scan/ready save the device's major-minor (dev_t) in the
struct btrfs_device so that we don't have to call lookup_bdev() again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the commit "btrfs: harden identification of the stale device", we
don't have to match the device path anymore. Instead, we match the dev_t.
So pass in the dev_t instead of the device path, in the call chain
btrfs_forget_devices()->btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Identifying and removing the stale device from the fs_uuids list is done
by btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn
depends on device_path_matched() to check if the device appears in more
than one btrfs_device structure.
The matching of the device happens by its path, the device path. However,
when device mapper is in use, the dm device paths are nothing but a link
to the actual block device, which leads to the device_path_matched()
failing to match.
Fix this by matching the dev_t as provided by lookup_bdev() instead of
plain string compare of the device paths.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_init_dev_replace_tgtdev() we dereference fs_info to get
fs_devices many times, instead save a point to the fs_devices.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl extracts inode from file so we can pass that into the
callbacks.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This simplifies the code flow in read_one_chunk and makes error handling
when handling missing devices a bit simpler by reducing it to a single
check if something went wrong. No functional changes.
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit 2f2ff0ee5e ("Btrfs:
fix metadata inconsistencies after directory fsync"), to fix some metadata
inconsistencies that in the meanwhile no longer need this behaviour due to
numerous other changes that happened throughout the years.
This behaviour, besides not needed anymore, it's also undesirable because:
1) It's not reliable because it's only triggered for the directories
of dentries (dir items) that happen to be present on a leaf that
was changed in the current transaction. If a dentry that points to
a directory resides on a leaf that was not changed in the current
transaction, then it's not logged, as at log_dir_items() and
log_new_dir_dentries() we use btrfs_search_forward();
2) It's not required by posix or any standard, it's undefined territory.
The only way to guarantee a subdirectory is logged, it to explicitly
fsync it;
Making the behaviour guaranteed would require scanning all directory
items, check which point to a directory, and then fsync each subdirectory
which was modified in the current transaction. This could be very
expensive for large directories with many subdirectories and/or large
subdirectories.
So remove that obsolete logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, we go over every leaf of the subvolume tree that
was changed in the current transaction and copy all its dir index keys to
the log tree.
That includes copying dir index keys created in past transactions. This is
done mostly for simplicity, as after logging the keys we log an item that
specifies the start and end ranges of the keys we logged. That item is
then used during log replay to figure out which keys need to be deleted -
every key in that range that we find in the subvolume tree and is not in
the log tree, needs to be deleted.
Now that we log only dir index keys, and not dir item keys anymore, when
we remove dentries from a directory (due to unlink and rename operations),
we can get entire leaves that we changed only for deleting old dir index
keys, or that have few dir index keys that are new - this is due to the
fact that the offset for new index keys comes from a monotonically
increasing counter.
We can avoid logging dir index keys from past transactions, and in order
to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
type) when we find gaps between consecutive index keys. This massively
reduces the amount of logged metadata when we have deleted directory
entries, even if it's a small percentage of the total number of entries.
The reduction comes from both less items that are logged and instead of
logging many dir index items (struct btrfs_dir_item), which have a size
of 30 bytes plus a file name, we typically log just a few range items
(struct btrfs_dir_log_item), which take only 8 bytes each.
Even if no entries were deleted from a directory and only new entries
were added, we typically still get a reduction on the amount of logged
metadata, because it's very likely the first leaf that got the new
dir index entries also has several old dir index entries.
So change the logging logic to not log dir index keys created in past
transactions and log a range item for every gap it finds between each
pair of consecutive index keys, to ensure deletions are tracked and
replayed on log replay.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
The following test was run on a branch without this patchset and on a
branch with the first three patches applied:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=1000000
NUM_FILE_DELETES=10000
MKFS_OPTIONS="-O no-holes -R free-space-tree"
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config),
and the results were the following for various values of NUM_FILES and
NUM_FILE_DELETES:
** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 585 ms after deleting 10000 files
** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 34 ms after deleting 10000 files (-94.2%)
** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 50 ms after deleting 1000 files
** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 7 ms after deleting 1000 files (-86.0%)
** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 9 ms after deleting 100 files
** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 5 ms after deleting 100 files (-44.4%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_set_inode_index_count() we refer twice to the number 2 as the
initial index value for a directory (when it's empty), with a proper
comment explaining the reason for that value. In the next patch I'll
have to use that magic value in the directory logging code, so put
the value in a #define at btrfs_inode.h, to avoid hardcoding the
magic value again at tree-log.c.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before we start to log dir index keys from a leaf, we check if there is a
previous index key, which normally is at the end of a leaf that was not
changed in the current transaction. Then we log that key and set the start
of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
that key. This is to ensure that if there were deleted index keys between
that key and the first key we are going to log, those deletions are
replayed in case we need to replay to the log after a power failure.
However we really don't need to log that previous key, we can just set the
start of the logged range to that key's offset plus 1. This achieves the
same and avoids logging one dir index key.
The same logic is performed when we finish logging the index keys of a
leaf and we find that the next leaf has index keys and was not changed in
the current transaction. We are logging the first key of that next leaf
and use its offset as the end of range we log. This is just to ensure that
if there were deleted index keys between the last index key we logged and
the first key of that next leaf, those index keys are deleted if we end
up replaying the log. However that is not necessary, we can avoid logging
that first index key of the next leaf and instead set the end of the
logged range to match the offset of that index key minus 1.
So avoid logging those index keys at the boundaries and adjust the start
and end offsets of the logged ranges as described above.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
Performance test results are listed in the changelog of patch 3/4.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl already contains pointers to the inode and btrfs_root
structs, so we can pass them into the subfunctions instead of the
toplevel struct file.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ->write and ->wait fields of struct walk_control, used for log trees,
are not used since 2008, more specifically since commit d0c803c404
("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
since commit d0c803c404 ("Btrfs: Record dirty pages tree-log pages in
an extent_io tree"). So just remove them, along with the function
btrfs_write_tree_block(), which is also not used anymore after removing
the ->write member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit handling code is not safe against memory-pressure deadlocks
when writing to swap. In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.
swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does. However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.
For safety, make it explicit in the code that direct writes used for swap
must always use FLUSH_STABLE.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
possible deadlocks with "fs_reclaim". These deadlocks could, I believe,
eventuate if a buffered read on the swapfile was attempted.
We don't need coherence with the page cache for a swap file, and
buffered writes are forbidden anyway. There is no other need for
i_rwsem during direct IO. So never take it for swap_rw()
2/ generic_write_checks() explicitly forbids writes to swap, and
performs checks that are not needed for swap. So bypass it
for swap_rw().
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.
This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC
to be used for some allocations. This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.
Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.
GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.
xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks,
then it blocks all other queued tasks. So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.
Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.
So with this patch:
1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
2/ __rpc_execute() sets PF_MEMALLOC while handling any task
with RPC_TASK_SWAPPER set, or when handling any task that
holds the XPRT_LOCKED lock on an xprt used for swap.
This removes the need for the RPC_IS_SWAPPER() test
in ->buf_alloc handlers.
3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
any task to a swapper xprt. __rpc_execute() will clear it.
3/ PF_MEMALLOC is set for all the connect workers.
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests. This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.
RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.
So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.
Remove both.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This code is pointless as IS_SWAPFILE is always defined.
So remove it.
Suggested-by: Mark Hemment <markhemm@googlemail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The fscache cookie APIs including fscache_acquire_cookie() and
fscache_relinquish_cookie() now have very good tracing. Thus,
there is no real need for dfprintks in the NFS fscache interface.
The NFS fscache interface has removed all dfprintks so remove the
NFSDBG_FSCACHE defines.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Most of fscache and other NFS IO paths are now using tracepoints.
Remove the dfprintks in the NFS fscache read/write page functions
and replace with tracepoints at the begin and end of the functions.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Rename NFS fscache functions in a more consistent fashion
to better reflect when we read from and write to fscache.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
A number of places in the fscache interface used nfs_inode when inode could
be used, simplifying the code.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In the presence of trunking transports, it's helpful to make sure
that during the migration event, the GETATTR for fs_location attribute
happens on the main transport.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[You don't often get email from khoroshilov@ispras.ru. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
Overflow check in not needed anymore after we switch to kmalloc_array().
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Fixes: a4f743a6bb ("NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
As a preparation for moving to -std=gnu11, turn off the
-Wshift-negative-value option. This warning is enabled by gcc when
building with -Wextra for c99 or higher, but not for c89. Since
the kernel already relies on well-defined overflow behavior,
the warning is not helpful and can simply be disabled in
all locations that use -Wextra.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 (x86-64)
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
This just puts trace_ext4_fc_commit_start(sb) & ktime_get()
for measuring FC commit time, after the check of whether sb
supports JOURNAL_FAST_COMMIT or not.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/d53cf3e535924ec0a1eb41a560e96561b0727e7a.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Below commit removed all references of EXT4_FC_COMMIT_FAILED.
commit 0915e464cb ("ext4: simplify updating of fast commit stats")
Just remove it since it is not used anymore.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/c941357e476be07a1138c7319ca5faab7fb80fc6.1647057583.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Recently I've got a report of BUG_ON trigerring during transaction
commit in ext4_journalled_writepage_callback() because we've spotted a
dirty page without buffers. Add WARN_ON_ONCE to
ext4_journalled_set_page_dirty() to catch the problematic condition
earlier where we have better chance of understanding which code path is
creating dirty data without preparing the page properly. Also update the
comment with current information when we are at it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220310101832.5645-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After moving to the new mount API, mb_optimize_scan mount option
handling was not working as expected due to the parsed value always
being overwritten by default. Refactor and fix this to the expected
behavior described below:
* mb_optimize_scan=1 - On
* mb_optimize_scan=0 - Off
* mb_optimize_scan not passed - On if no. of BGs > threshold else off
* Remounts retain previous value unless we explicitly pass the option
with a new value
Fixes: cebe85d570 ("ext4: switch to the new mount api")
Cc: stable@kernel.org
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c98970fe99f26718586d02e942f293300fb48ef3.1646732698.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Julia and the kernel test robot report that the prep handling for this
command inadvertently checks one field twice:
fs/io_uring.c:4338:42-56: duplicated argument to && or ||
Get rid of it.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Fixes: 4f57f06ce2 ("io_uring: add support for IORING_OP_MSG_RING command")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc fixes from David Howells:
"A set of patches for watch_queue filter issues noted by Jann. I've
added in a cleanup patch from Christophe Jaillet to convert to using
formal bitmap specifiers for the note allocation bitmap.
Also two filesystem fixes (afs and cachefiles)"
* emailed patches from David Howells <dhowells@redhat.com>:
cachefiles: Fix volume coherency attribute
afs: Fix potential thrashing in afs writeback
watch_queue: Make comment about setting ->defunct more accurate
watch_queue: Fix lack of barrier/sync/lock between post and read
watch_queue: Free the alloc bitmap when the watch_queue is torn down
watch_queue: Fix the alloc bitmap size to reflect notes allocated
watch_queue: Use the bitmap API when applicable
watch_queue: Fix to always request a pow-of-2 pipe ring size
watch_queue: Fix to release page in ->release()
watch_queue, pipe: Free watchqueue state after clearing pipe ring
watch_queue: Fix filter limit check
A network filesystem may set coherency data on a volume cookie, and if
given, cachefiles will store this in an xattr on the directory in the
cache corresponding to the volume.
The function that sets the xattr just stores the contents of the volume
coherency buffer directly into the xattr, with nothing added; the
checking function, on the other hand, has a cut'n'paste error whereby it
tries to interpret the xattr contents as would be the xattr on an
ordinary file (using the cachefiles_xattr struct). This results in a
failure to match the coherency data because the buffer ends up being
shifted by 18 bytes.
Fix this by defining a structure specifically for the volume xattr and
making both the setting and checking functions use it.
Since the volume coherency doesn't work if used, take the opportunity to
insert a reserved field for future use, set it to 0 and check that it is
0. Log mismatch through the appropriate tracepoint.
Note that this only affects cifs; 9p, afs, ceph and nfs don't use the
volume coherency data at the moment.
Fixes: 32e150037d ("fscache, cachefiles: Store the volume coherency data")
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <smfrench@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In afs_writepages_region(), if the dirty page we find is undergoing
writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go
round the loop trying the same page again and again with no pausing or
waiting unless and until another thread manages to clear the writeback
and fscache flags.
Fix this with three measures:
(1) Advance start to after the page we found.
(2) Break out of the loop and return if rescheduling is requested.
(3) Arbitrarily give up after a maximum of 5 skips.
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Acked-by: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.
Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared. We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These have been incorrect since the function was introduced.
A proper kerneldoc comment is added since this function, though
static, is part of an external interface.
Reported-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The common practice is to name function instances the same as the
method names, but with a uniquifying prefix. Commit aef9583b23
("NFSD: Get reference of lockowner when coping file_lock") missed
this -- the new function names should both have been of the form
"nfsd4_lm_*".
Before more lock manager operations are added in NFSD, rename these
two functions for consistency.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Eventually support for NFSv2 in the Linux NFS server is to be
deprecated and then removed.
However, NFSv2 is the "always supported" version that is available
as soon as CONFIG_NFSD is set. Before NFSv2 support can be removed,
we need to choose a different "always supported" version.
This patch removes CONFIG_NFSD_V3 so that NFSv3 is always supported,
as NFSv2 is today. When NFSv2 support is removed, NFSv3 will become
the only "always supported" NFS version.
The defconfigs still need to be updated to remove CONFIG_NFSD_V3=y.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.
Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are a small handful of reasons besides pending signals that the
kernel might want to break out of interruptible sleeps. The flag
TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL
provide that the infrastructure for breaking out of interruptible
sleeps and entering the return to user space slow path for those
cases.
Expand tracehook_notify_signal inline in it's callers and remove it,
which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate
concepts.
Update the comment on set_notify_signal to more accurately describe
it's purpose.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-9-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
By default, io_uring will stop submitting a batch of requests if we run
into an error submitting a request. This isn't strictly necessary, as
the error result is passed out-of-band via a CQE anyway. And it can be
a bit confusing for some applications.
Provide a way to setup a ring that will continue submitting on error,
when the error CQE has been posted.
There's still one case that will break out of submission. If we fail
allocating a request, then we'll still return -ENOMEM. We could in theory
post a CQE for that condition too even if we never got a request. Leave
that for a potential followup.
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wrap the test of task->task_works in a helper function to make
it clear what is being tested.
All of the other readers of task->task_work use READ_ONCE and this is
even necessary on current as other processes can update
task->task_work. So for consistency I have added READ_ONCE into
task_work_pending.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If we are using provided buffers, it's less than useful to have a buffer
selected and pinned if a request needs to go async or arms poll for
notification trigger on when we can process it.
Recycle the buffer in those events, so we don't pin it for the duration
of the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most of the logic in io_read() deals with regular files, and in some ways
it would make sense to split the handling into S_IFREG and others. But
at least for retry, we don't need to bother setting up a bunch of state
just to abort in the loop later. In particular, don't bother forcing
setup of async data for a normal non-vectored read when we don't need it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One of the key architectual tenets is to keep the parameters for
io-uring stable. After the call has been submitted, its value can
be changed. Unfortunaltely this is not the case for the current statx
implementation.
IO-Uring change:
This changes replaces the const char * filename pointer in the io_statx
structure with a struct filename *. In addition it also creates the
filename object during the prepare phase.
With this change, the opcode also needs to invoke cleanup, so the
filename object gets freed after processing the request.
fs change:
This replaces the const char* __user filename parameter in the two
functions do_statx and vfs_statx with a struct filename *. In addition
to be able to correctly construct a filename object a new helper
function getname_statx_lookup_flags is introduced. The function makes
sure that do_statx and vfs_statx is invoked with the correct lookup flags.
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sqpoll thread can be used for performing the napi busy poll in a
similar way that it does io polling for file systems supporting direct
access bypassing the page cache.
The other way that io_uring can be used for napi busy poll is by
calling io_uring_enter() to get events.
If the user specify a timeout value, it is distributed between polling
and sleeping by using the systemwide setting
/proc/sys/net/core/busy_poll.
The changes have been tested with this program:
https://github.com/lano1106/io_uring_udp_ping
and the result is:
Without sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us
With sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us
Co-developed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for IORING_OP_MSG_RING, which allows an SQE to signal
another ring. That allows either waking up someone waiting on the ring,
or even passing a 64-bit value via the user_data field in the CQE.
sqe->fd must contain the fd of a ring that should receive the CQE.
sqe->off will be propagated to the cqe->user_data on the target ring,
and sqe->len will be propagated to cqe->res. The results CQE will have
IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated
from a messaging request rather than a SQE issued locally on that ring.
This effectively allows passing a 64-bit and a 32-bit quantify between
the two rings.
This request type has the following request specific error cases:
- -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is
of the io_uring type.
- -EOVERFLOW. Set if we were not able to deliver a request to the target
ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In testing high frequency workloads with provided buffers, we spend a
lot of time in allocating and freeing the buffer units themselves.
Rather than repeatedly free and alloc them, add a recycling cache
instead. There are two caches:
- ctx->io_buffers_cache. This is the one we grab from in the submission
path, and it's protected by ctx->uring_lock. For inline completions,
we can recycle straight back to this cache and not need any extra
locking.
- ctx->io_buffers_comp. If we're not under uring_lock, then we use this
list to recycle buffers. It's protected by the completion_lock.
On adding a new buffer, check io_buffers_cache. If it's empty, check if
we can splice entries from the io_buffers_comp_cache.
This reduces about 5-10% of overhead from provided buffers, bringing it
pretty close to the non-provided path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lots of workloads use multiple threads, in which case the file table is
shared between them. This makes getting and putting the ring file
descriptor for each io_uring_enter(2) system call more expensive, as it
involves an atomic get and put for each call.
Similarly to how we allow registering normal file descriptors to avoid
this overhead, add support for an io_uring_register(2) API that allows
to register the ring fds themselves:
1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
structs, and registers them with the task.
2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
structs, and unregisters them.
When a ring fd is registered, it is internally represented by an offset.
This offset is returned to the application, and the application then
uses this offset and sets IORING_ENTER_REGISTERED_RING for the
io_uring_enter(2) system call. This works just like using a registered
file descriptor, rather than a real one, in an SQE, where
IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
offset/descriptor rather than a real file descriptor.
In initial testing, this provides a nice bump in performance for
threaded applications in real world cases where the batch count (eg
number of requests submitted per io_uring_enter(2) invocation) is low.
In a microbenchmark, submitting NOP requests, we see the following
increases in performance:
Requests per syscall Baseline Registered Increase
----------------------------------------------------------------
1 ~7030K ~8080K +15%
2 ~13120K ~14800K +13%
4 ~22740K ~25300K +11%
Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a slight optimisation to be had by calculating the correct pos
pointer inside io_kiocb_update_pos and then using that later.
It seems code size drops by a bit:
000000000000a1b0 0000000000000400 t io_read
000000000000a5b0 0000000000000319 t io_write
vs
000000000000a1b0 00000000000003f6 t io_read
000000000000a5b0 0000000000000310 t io_write
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Update kiocb->ki_pos at execution time rather than in io_prep_rw().
io_prep_rw() happens before the job is enqueued to a worker and so the
offset might be read multiple times before being executed once.
Ensures that the file position in a set of _linked_ SQEs will be only
obtained after earlier SQEs have completed, and so will include their
incremented file position.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_kiocb_ppos is called in both branches, and it seems that the compiler
does not fuse this. Fusing removes a few bytes from loop_rw_iter.
Before:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 0000000000000124 t loop_rw_iter
After:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 000000000000010d t loop_rw_iter
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This makes the io-uring tracepoints consistent. Where it makes sense
the tracepoints start with the following four fields:
- context (ring)
- request
- user_data
- opcode.
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This introduces the __fill_cqe function. This is necessary
to correctly issue the io_uring_complete tracepoint.
Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wqe->lock is abused, it now protects acct->work_list, hash stuff,
nr_workers, wqe->free_list and so on. Lets first get the work_list out
of the wqe-lock mess by introduce a specific lock for work list. This
is the first step to solve the huge contension between work insertion
and work consumption.
good thing:
- split locking for bound and unbound work list
- reduce contension between work_list visit and (worker's)free_list.
For the hash stuff, since there won't be a work with same file in both
bound and unbound work list, thus they won't visit same hash entry. it
works well to use the new lock to protect hash stuff.
Results:
set max_unbound_worker = 4, test with echo-server:
nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
(-n connection, -l workload)
before this patch:
Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
Overhead Command Shared Object Symbol
28.59% iou-wrk-10021 [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
8.89% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
6.20% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock
2.45% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work
2.36% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
2.29% iou-wrk-10021 [kernel.vmlinux] [k] io_worker_handle_work
1.29% io_uring_echo_s [kernel.vmlinux] [k] io_wqe_enqueue
1.06% iou-wrk-10021 [kernel.vmlinux] [k] io_wqe_worker
1.06% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock
1.03% iou-wrk-10021 [kernel.vmlinux] [k] __schedule
0.99% iou-wrk-10021 [kernel.vmlinux] [k] tcp_sendmsg_locked
with this patch:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
Overhead Command Shared Object Symbol
16.86% iou-wrk-10893 [kernel.vmlinux] [k] native_queued_spin_lock_slowpat
9.10% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock
4.53% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpat
2.87% iou-wrk-10893 [kernel.vmlinux] [k] io_worker_handle_work
2.57% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
2.56% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work
1.82% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock
1.33% iou-wrk-10893 [kernel.vmlinux] [k] io_wqe_worker
1.26% io_uring_echo_s [kernel.vmlinux] [k] try_to_wake_up
spin_lock failure from 25.59% + 8.89% = 34.48% to 16.86% + 4.53% = 21.39%
TPS is similar, while cpu usage is from almost 400% to 350%
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clang warns:
fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
return ret;
^~~
fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning
int fd, ret;
^
= 0
1 warning generated.
Just return 0 directly and reduce the scope of ret to the if statement,
as that is the only place that it is used, which is how the function was
before the fixes commit.
Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd")
Link: https://github.com/ClangBuiltLinux/linux/issues/1579
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the opcodes in io_uring_register use ring quiesce anymore. Hence
io_register_op_must_quiesce always returns false and io_ctx_quiesce is
never called.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SETUP_R_DISABLED prevents submitting requests and so there will be
no requests until IORING_REGISTER_ENABLE_RINGS is called. And
IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed
for these opcodes.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done using the RCU data structure (io_ev_fd). eventfd_async is
moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding
ring quiesce which is much more expensive than an RCU lock. The place
where eventfd_async is read is already under rcu_read_lock so there is no
extra RCU read-side critical section needed.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.
The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.
The process of registering/unregistering eventfd is already done under
uring_lock so multiple threads won't enter a race condition while
registering/unregistering eventfd.
With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_register
with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared
to 15ms before with ring quiesce.
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com
[axboe: long line fixups]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The information on whether eventfd is registered is not very useful and
would result in the tracepoint being enclosed in an rcu_readlock in a
later patch that tries to avoid ring quiesce for registering eventfd.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
F2FS_FITS_IN_INODE only cares the type of f2fs inode, so there
is no need to read node page of f2fs inode.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are a few places where we test the current process' capability set
to decide if we're going to be more or less generous with resource
acquisition for a system call. If the process doesn't have the
capability, we can continue the call, albeit in a degraded mode.
These are /not/ the actual security decisions, so it's not proper to use
capable(), which (in certain selinux setups) causes audit messages to
get logged. Switch them to has_capability_noaudit.
Fixes: 7317a03df7 ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom")
Fixes: ea9a46e1c4 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
COW extents are already converted into written real extents after
xfs_reflink_convert_cow_locked(), therefore cmap->br_state should
reflect it.
Otherwise, there is another necessary unwritten convertion
triggered in xfs_dio_write_end_io() for direct I/O cases.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Recently the kernel test robot reported:
> In file included from include/linux/kernel.h:29,
> from fs/binfmt_flat.c:21:
> fs/binfmt_flat.c: In function 'flat_core_dump':
> >> fs/binfmt_flat.c:121:50: error: invalid use of undefined type 'struct coredump_params'
> 121 | current->comm, current->pid, cprm->siginfo->si_signo);
> | ^~
> include/linux/printk.h:418:33: note: in definition of macro 'printk_index_wrap'
> 418 | _p_func(_fmt, ##__VA_ARGS__); \
> | ^~~~~~~~~~~
> include/linux/printk.h:499:9: note: in expansion of macro 'printk'
> 499 | printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
> | ^~~~~~
> fs/binfmt_flat.c:120:9: note: in expansion of macro 'pr_warn'
> 120 | pr_warn("Process %s:%d received signr %d and should have core dumped\n",
> | ^~~~~~~
> At top level:
> fs/binfmt_flat.c:118:12: warning: 'flat_core_dump' defined but not used [-Wunused-function]
> 118 | static int flat_core_dump(struct coredump_params *cprm)
> | ^~~~~~~~~~~~~~
The little dinky do nothing function flat_core_dump has always been
compiled unconditionally. With my change to move coredump_params into
coredump.h coredump_params reasonably becomes unavailable when
coredump support is not compiled in. Fix this old issue by simply not
compiling flat_core_dump when coredump support is not supported.
Fixes: a99a3e2efa ("coredump: Move definition of struct coredump_params into coredump.h")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Refactor block I/O code so that the bio operation and known flags are set
at bio allocation time. Only the later updated flags are updated on the
fly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220228124123.856027-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set the bdev at bio allocation time by changing the f2fs_target_device
calling conventions, so that no bio needs to be passed in.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220228124123.856027-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The value is now completely unused except for reporting it back through
the F_GET_FILE_RW_HINT ioctl, so remove the value and the two ioctls
for it.
Trying to use the F_SET_FILE_RW_HINT and F_GET_FILE_RW_HINT fcntls will
now return EINVAL, just like it would on a kernel that never supported
this functionality in the first place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This field is entirely unused now except for a tracepoint in f2fs, so
remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Matthew Wilcox reported that there is a missing mmap_lock in
file_files_note that could possibly lead to a user after free.
Solve this by using the existing vma snapshot for consistency
and to avoid the need to take the mmap_lock anywhere in the
coredump code except for dump_vma_snapshot.
Update the dump_vma_snapshot to capture vm_pgoff and vm_file
that are neeeded by fill_files_note.
Add free_vma_snapshot to free the captured values of vm_file.
Reported-by: Matthew Wilcox <willy@infradead.org>
Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org
Cc: stable@vger.kernel.org
Fixes: a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: 2aa362c49c ("coredump: extend core dump note section to contain file names of mapped files")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Instead of individually passing cprm->siginfo and cprm->regs
into fill_note_info pass all of struct coredump_params.
This is preparation to allow fill_files_note to use the existing
vma snapshot.
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The condition is impossible and to the best of my knowledge has never
triggered.
We are in deep trouble if that conditions happens and we walk past
the end of our allocated array.
So delete the WARN_ON and the code that makes it look like the kernel
can handle the case of walking past the end of it's vma_meta array.
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the
individual coredump routines into do_coredump itself. This makes
the code less error prone and easier to maintain.
Make the vma snapshot available to the coredump routines
in struct coredump_params. This makes it easier to
change and update what is captures in the vma snapshot
and will be needed for fixing fill_file_notes.
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Move the definition of struct coredump_params into coredump.h where
it belongs.
Remove the slightly errorneous comment explaining why struct
coredump_params was declared in binfmts.h.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYidm9QAKCRDh3BK/laaZ
PAWVAQCMbTjDZzjk3jPVaCR9xS4mhDfjnQGH6FJmykpheGy59gD6AvdDAtwYsiRq
alvwpDyeI6YVueBKst1joFFcnzKEPAU=
=sxAr
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
- Fix an issue with splice on the fuse device
- Fix a regression in the fileattr API conversion
- Add a small userspace API improvement
* tag 'fuse-fixes-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix pipe buffer lifetime for direct_io
fuse: move FUSE_SUPER_MAGIC definition to magic.h
fuse: fix fileattr op failure
Variable etype is being assigned a value that is never read. The
variable and assignment are redundant and can be removed.
Cleans up clang scan build warning:
fs/udf/super.c:2485:10: warning: Although the value stored to 'etype'
is used in the enclosing expression, the value is never actually read
from 'etype' [deadcode.DeadStores]
Link: https://lore.kernel.org/r/20220307152149.139045-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.
On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.
This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.
Fix by copying pages coming from the user address space to new pipe
buffers.
Reported-by: Jann Horn <jannh@google.com>
Fixes: c3021629a0 ("fuse: support splice() reading from fuse device")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the %pg format specifier to save on stack consuption and code size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220304180105.409765-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmIk1isACgkQxWXV+ddt
WDsAVQ//TvkKObQLL/BJ4TFSxr1ZLs83z4vTcss2W/MrMjGWUut1fhUTGlhkqgC6
RE03VBuUV983k09/Tn3Q0AHSXcMAmxEv/t1QweJNKiVv7YKT3Nj7VF3kHioFz9g/
gZ5q9FVbTXkrl4tgcwiQXbLJ1BLWBfXTAMatKgsIQBYsYg0ec3GGem/tx3OlvdNt
9My6EJhNo5X7vrTMjRUygDgHDhcAgp/gYMa2VmnPhK5qcPzmIYbt4CJGLQDwiiiB
KSsXnsHCXKm/BRPgtgnMBH6O8YruaxUn0nEQMjntGx8RHbZrkdXk90PaK7pmWz1W
KkbHTM98zclAOWUx6JmGw8mb9aZQo6aGpou2Pa98aBtHhvbhiKYS2W2OOnHbAshK
2bj6W2o89eYHKgX+5fICWHt7efUoWUh1KPC+TeaV8DKl8q0a9DC3KfIL/v7PZacA
pIyyy4uyXh3finzI+Q+fW7QVKQhpcQKLuq5EVGCMEotlfsn+SJBselAdwUl9ChUp
ALAiYn1T8W1Mrt8P2mxB29btGrdckHtpoWTgr++OAZaX4PABF3GAvIxXwmFg2aMK
zfXKwTxjwKM42H3AWaLHttk4OA7FJhY9sgOproON/3Tn9cBSK2jiO0HSk1dBn/dL
WQbOKh4Z+VDXi5niF8hmTANTNO0wS0JdiKZX86tYyhcCl0ZBr/w=
=Bd5z
-----END PGP SIGNATURE-----
Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for various problems that have user visible effects
or seem to be urgent:
- fix corruption when combining DIO and non-blocking io_uring over
multiple extents (seen on MariaDB)
- fix relocation crash due to premature return from commit
- fix quota deadlock between rescan and qgroup removal
- fix item data bounds checks in tree-checker (found on a fuzzed
image)
- fix fsync of prealloc extents after EOF
- add missing run of delayed items after unlink during log replay
- don't start relocation until snapshot drop is finished
- fix reversed condition for subpage writers locking
- fix warning on page error"
* tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fallback to blocking mode when doing async dio over multiple extents
btrfs: add missing run of delayed items after unlink during log replay
btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
btrfs: do not start relocation until in progress drops are done
btrfs: tree-checker: use u64 for item data end to avoid overflow
btrfs: do not WARN_ON() if we have PageError set
btrfs: fix lost prealloc extents beyond eof after full fsync
btrfs: subpage: fix a wrong check on subpage->writers
Since bit 57 was exported for uffd-wp write-protected (commit
fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"),
fixing it can reduce some unnecessary confusion.
Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
Fixes: fb8e37f35a ("mm/pagemap: export uffd-wp protection information")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Colin Cross <ccross@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unfair rwsem should be used when blk-cg is on. Otherwise, there is regression.
FYI, we noticed a -26.7% regression of aim7.jobs-per-min due to commit:
commit: e4544b63a7 ("f2fs: move f2fs to use reader-unfair rwsems")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
in testcase: aim7
on test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory
with following parameters:
disk: 4BRD_12G
md: RAID0
fs: f2fs
test: sync_disk_rw
load: 100
cpufreq_governor: performance
ucode: 0x500320a
test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit 51bd9563b6 ("btrfs: fix deadlock due to page faults
during direct IO reads and writes"). That changed btrfs to use the new
iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
corresponds to a memory mapped file region. That type of scenario is
exercised by test case generic/647 from fstests.
For this MariaDB scenario, we attempt to read 16K from file offset X
using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
with a size of 4K, and what happens is the following:
1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
2) iomap creates a struct iomap_dio object, its reference count is
initialized to 1 and its ->size field is initialized to 0;
3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
the first 4K extent, and setups an iomap for this extent consisting
of a single page;
4) At iomap_dio_bio_iter(), we are able to access the first page of the
buffer (struct iov_iter) with bio_iov_iter_get_pages() without
triggering a page fault;
5) iomap submits a bio for this 4K extent
(iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
the refcount on the struct iomap_dio object to 2; The ->size field
of the struct iomap_dio object is incremented to 4K;
6) iomap calls btrfs_iomap_begin() again, this time with a file
offset of X + 4K. There we setup an iomap for the next extent
that also has a size of 4K;
7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
which tries to access the next page (2nd page) of the buffer.
This triggers a page fault and returns -EFAULT;
8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
the struct iomap_dio object has a ->size value of 4K (we submitted
a bio for an extent already). The 'wait_for_completion' variable
is not set to true, because our iocb has IOCB_NOWAIT set;
9) At the bottom of __iomap_dio_rw(), we decrement the reference count
of the struct iomap_dio object from 2 to 1. Because we were not
the only ones holding a reference on it and 'wait_for_completion' is
set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
just returns it up the callchain, up to io_uring;
10) The bio submitted for the first extent (step 5) completes and its
bio endio function, iomap_dio_bio_end_io(), decrements the last
reference on the struct iomap_dio object, resulting in calling
iomap_dio_complete_work() -> iomap_dio_complete().
11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
and return 4K (the amount of io done) to iomap_dio_complete_work();
12) iomap_dio_complete_work() calls the iocb completion callback,
iocb->ki_complete() with a second argument value of 4K (total io
done) and the iocb with the adjust ki_pos of X + 4K. This results
in completing the read request for io_uring, leaving it with a
result of 4K bytes read, and only the first page of the buffer
filled in, while the remaining 3 pages, corresponding to the other
3 extents, were not filled;
13) For the application, the result is unexpected because if we ask
to read N bytes, it expects to get N bytes read as long as those
N bytes don't cross the EOF (i_size).
MariaDB reports this as an error, as it's not expecting a short read,
since it knows it's asking for read operations fully within the i_size
boundary. This is typical in many applications, but it may also be
questionable if they should react to such short reads by issuing more
read calls to get the remaining data. Nevertheless, the short read
happened due to a change in btrfs regarding how it deals with page
faults while in the middle of a read operation, and there's no reason
why btrfs can't have the previous behaviour of returning the whole data
that was requested by the application.
The problem can also be triggered with the following simple program:
/* Get O_DIRECT */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <liburing.h>
int main(int argc, char *argv[])
{
char *foo_path;
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec iovec;
int fd;
long pagesize;
void *write_buf;
void *read_buf;
ssize_t ret;
int i;
if (argc != 2) {
fprintf(stderr, "Use: %s <directory>\n", argv[0]);
return 1;
}
foo_path = malloc(strlen(argv[1]) + 5);
if (!foo_path) {
fprintf(stderr, "Failed to allocate memory for file path\n");
return 1;
}
strcpy(foo_path, argv[1]);
strcat(foo_path, "/foo");
/*
* Create file foo with 2 extents, each with a size matching
* the page size. Then allocate a buffer to read both extents
* with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
* the read with io_uring, access the first page of the buffer
* to fault it in, so that during the read we only trigger a
* page fault when accessing the second page of the buffer.
*/
fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
O_DIRECT, 0666);
if (fd == -1) {
fprintf(stderr,
"Failed to create file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate write buffer\n");
return 1;
}
memset(write_buf, 0xab, pagesize);
memset(write_buf + pagesize, 0xcd, pagesize);
/* Create 2 extents, each with a size matching page size. */
for (i = 0; i < 2; i++) {
ret = pwrite(fd, write_buf + i * pagesize, pagesize,
i * pagesize);
if (ret != pagesize) {
fprintf(stderr,
"Failed to write to file, ret = %ld errno %d (%s)\n",
ret, errno, strerror(errno));
return 1;
}
ret = fsync(fd);
if (ret != 0) {
fprintf(stderr, "Failed to fsync file\n");
return 1;
}
}
close(fd);
fd = open(foo_path, O_RDONLY | O_DIRECT);
if (fd == -1) {
fprintf(stderr,
"Failed to open file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate read buffer\n");
return 1;
}
/*
* Fault in only the first page of the read buffer.
* We want to trigger a page fault for the 2nd page of the
* read buffer during the read operation with io_uring
* (O_DIRECT and IOCB_NOWAIT).
*/
memset(read_buf, 0, 1);
ret = io_uring_queue_init(1, &ring, 0);
if (ret != 0) {
fprintf(stderr, "Failed to create io_uring queue\n");
return 1;
}
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "Failed to get io_uring sqe\n");
return 1;
}
iovec.iov_base = read_buf;
iovec.iov_len = 2 * pagesize;
io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
ret = io_uring_submit_and_wait(&ring, 1);
if (ret != 1) {
fprintf(stderr,
"Failed at io_uring_submit_and_wait()\n");
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
return 1;
}
printf("io_uring read result for file foo:\n\n");
printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n",
memcmp(read_buf, write_buf, 2 * pagesize));
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
return 0;
}
When running it on an unpatched kernel:
$ gcc io_uring_test.c -luring
$ mkfs.btrfs -f /dev/sda
$ mount /dev/sda /mnt/sda
$ ./a.out /mnt/sda
io_uring read result for file foo:
cqe->res == 4096 (expected 8192)
memcmp(read_buf, write_buf) == -205 (expected 0)
After this patch, the read always returns 8192 bytes, with the buffer
filled with the correct data. Although that reproducer always triggers
the bug in my test vms, it's possible that it will not be so reliable
on other environments, as that can happen if the bio for the first
extent completes and decrements the reference on the struct iomap_dio
object before we do the atomic_dec_and_test() on the reference at
__iomap_dio_rw().
Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
set) over a range that spans multiple extents (or a mix of extents and
holes). This avoids returning success to the caller when we only did
partial IO, which is not optimal for writes and for reads it's actually
incorrect, as the caller doesn't expect to get less bytes read than it has
requested (unless EOF is crossed), as previously mentioned. This is also
the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
even though it doesn't use IOMAP_DIO_PARTIAL.
A test case for fstests will follow soon.
Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Quoted from Jing Xia's report, there is a potential deadlock may happen
between kworker and checkpoint as below:
[T:writeback] [T:checkpoint]
- wb_writeback
- blk_start_plug
bio contains NodeA was plugged in writeback threads
- do_writepages -- sync write inodeB, inc wb_sync_req[DATA]
- f2fs_write_data_pages
- f2fs_write_single_data_page -- write last dirty page
- f2fs_do_write_data_page
- set_page_writeback -- clear page dirty flag and
PAGECACHE_TAG_DIRTY tag in radix tree
- f2fs_outplace_write_data
- f2fs_update_data_blkaddr
- f2fs_wait_on_page_writeback -- wait NodeA to writeback here
- inode_dec_dirty_pages
- writeback_sb_inodes
- writeback_single_inode
- do_writepages
- f2fs_write_data_pages -- skip writepages due to wb_sync_req[DATA]
- wbc->pages_skipped += get_dirty_pages() -- PAGECACHE_TAG_DIRTY is not set but get_dirty_pages() returns one
- requeue_inode -- requeue inode to wb->b_dirty queue due to non-zero.pages_skipped
- blk_finish_plug
Let's try to avoid deadlock condition by forcing unplugging previous bio via
blk_finish_plug(current->plug) once we'v skipped writeback in writepages()
due to valid sbi->wb_sync_req[DATA/NODE].
Fixes: 687de7f101 ("f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When cachefiles_shorten_object() calls fallocate() to shape the cache
file to match the DIO size, it passes the total file size it wants to
achieve, not the amount of zeros that should be inserted. Since this is
meant to preallocate that amount of storage for the file, it can cause
the cache to fill up the disk and hit ENOSPC.
Fix this by passing the length actually required to go from the current
EOF to the desired EOF.
Fixes: 7623ed6772 ("cachefiles: Implement cookie resize for truncate")
Reported-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/164630854858.3665356.17419701804248490708.stgit@warthog.procyon.org.uk # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.
Link: https://lkml.kernel.org/r/20220210072828.2930359-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.
Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1. This is a separate issue and will
be sorted out later. Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.
Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>.
Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christian Knig <christian.koenig@amd.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.
Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance. A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].
This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).
So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed. More discussion and background can be found in the
thread starting at [2].
[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
The return value from function gfs2_indirect_init is never used, so
remove it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Variable split_flag1 is being assigned a value that is never read,
it is being re-assigned a new value in the following code block.
The assignment is redundant and can be removed.
Cleans up clang scan build warning:
fs/ext4/extents.c:3371:2: warning: Value stored to 'split_flag1' is
never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20220301121644.997833-1-colin.i.king@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
when ext4 filesystem is created with 64k block size, ^extent and
^huge_file features. the upper_limit would underflow during the
computations in ext4_max_bitmap_size(). The problem is the size of block
index tree for such large block size is more than i_blocks can carry.
So fix the computation to count with this possibility. After this fix,
the 'res' cannot overflow loff_t on the extreme case of filesystem with
huge_files and 64K block size, so this patch also revert commit
75ca6ad408 ("ext4: fix loff_t overflow in ext4_max_bitmap_size()").
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220301111704.2153829-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove the excess description of @bh in ext4_mb_clear_bb() kernel-doc
comment to remove warnings found by running scripts/kernel-doc, which
is caused by using 'make W=1'.
fs/ext4/mballoc.c:5895: warning: Excess function parameter 'bh'
description in 'ext4_mb_clear_bb'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220301092136.34764-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
[ 110.920551] ext4_empty_dir: inject fault
[ 110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '..' in .../??? (13) has deleted/unused inode 12. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13 (...)
Connect to /lost+found<y>? yes
Pass 4: Checking reference counts
Inode 13 ref count is 3, should be 2. Fix<y>? yes
Pass 5: Checking group summary information
/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks
ext4_rmdir
if (!ext4_empty_dir(inode))
goto end_rmdir;
ext4_empty_dir
bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
if (IS_ERR(bh))
return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Fix ztailpacking z_idataoff getting trimmed on > 4GiB filesystems.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYh98dxEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBJ5NAQDb7o+5L9lnmxyVnx4UvPnj4k0JKnNEqO7Q
iCuVn31XQAD/Six/UARs13Tbvinvoigb1CWxSrLhh3QDumWwDl8l7Ak=
=906m
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.17-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
"A one-line patch to fix the new ztailpacking feature on > 4GiB
filesystems because z_idataoff can get trimmed improperly.
ztailpacking is still a brand new EXPERIMENTAL feature, but it'd be
better to fix the issue as soon as possible to avoid unnecessary
backporting.
Summary:
- Fix ztailpacking z_idataoff getting trimmed on > 4GiB filesystems"
* tag 'erofs-for-5.17-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix ztailpacking on > 4GiB filesystems
Reiserfs is relatively old filesystem and its development has ceased
quite some years ago. Linux distributions moved away from it towards
other filesystems such as btrfs, xfs, or ext4. To reduce maintenance
burden on cross filesystem changes (such as new mount API, iomap, folios
...) let's add a deprecation notice when the filesystem is mounted and
schedule its removal to 2025.
Link: https://lore.kernel.org/r/20220225125445.29942-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
During log replay, whenever we need to check if a name (dentry) exists in
a directory we do searches on the subvolume tree for inode references or
or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
keys as well, before kernel 5.17). However when during log replay we
unlink a name, through btrfs_unlink_inode(), we may not delete inode
references and dir index keys from a subvolume tree and instead just add
the deletions to the delayed inode's delayed items, which will only be
run when we commit the transaction used for log replay. This means that
after an unlink operation during log replay, if we attempt to search for
the same name during log replay, we will not see that the name was already
deleted, since the deletion is recorded only on the delayed items.
We run delayed items after every unlink operation during log replay,
except at unlink_old_inode_refs() and at add_inode_ref(). This was due
to an overlook, as delayed items should be run after evert unlink, for
the reasons stated above.
So fix those two cases.
Fixes: 0d836392ca ("Btrfs: fix mount failure after fsync due to hard link recreation")
Fixes: 1f250e929a ("Btrfs: fix log replay failure after unlink and link combination")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit e804861bd4 ("btrfs: fix deadlock between quota disable and
qgroup rescan worker") by Kawasaki resolves deadlock between quota
disable and qgroup rescan worker. But also there is a deadlock case like
it. It's about enabling or disabling quota and creating or removing
qgroup. It can be reproduced in simple script below.
for i in {1..100}
do
btrfs quota enable /mnt &
btrfs qgroup create 1/0 /mnt &
btrfs qgroup destroy 1/0 /mnt &
btrfs quota disable /mnt &
done
Here's why the deadlock happens:
1) The quota rescan task is running.
2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
the quota rescan task to complete.
3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
the qgroup_ioctl_lock mutex, because it's being held by task A. At that
point task B is holding a transaction handle for the current transaction.
4) The quota rescan task calls btrfs_commit_transaction(). This results
in it waiting for all other tasks to release their handles on the
transaction, but task B is blocked on the qgroup_ioctl_lock mutex
while holding a handle on the transaction, and that mutex is being held
by task A, which is waiting for the quota rescan task to complete,
resulting in a deadlock between these 3 tasks.
To resolve this issue, the thread disabling quota should unlock
qgroup_ioctl_lock before waiting rescan completion. Move
btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
Fixes: e804861bd4 ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit a bug with a recovering relocation on mount for one of our file
systems in production. I reproduced this locally by injecting errors
into snapshot delete with balance running at the same time. This
presented as an error while looking up an extent item
WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
RIP: 0010:lookup_inline_extent_backref+0x647/0x680
RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
Call Trace:
<TASK>
insert_inline_extent_backref+0x46/0xd0
__btrfs_inc_extent_ref.isra.0+0x5f/0x200
? btrfs_merge_delayed_refs+0x164/0x190
__btrfs_run_delayed_refs+0x561/0xfa0
? btrfs_search_slot+0x7b4/0xb30
? btrfs_update_root+0x1a9/0x2c0
btrfs_run_delayed_refs+0x73/0x1f0
? btrfs_update_root+0x1a9/0x2c0
btrfs_commit_transaction+0x50/0xa50
? btrfs_update_reloc_root+0x122/0x220
prepare_to_merge+0x29f/0x320
relocate_block_group+0x2b8/0x550
btrfs_relocate_block_group+0x1a6/0x350
btrfs_relocate_chunk+0x27/0xe0
btrfs_balance+0x777/0xe60
balance_kthread+0x35/0x50
? btrfs_balance+0xe60/0xe60
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Normally snapshot deletion and relocation are excluded from running at
the same time by the fs_info->cleaner_mutex. However if we had a
pending balance waiting to get the ->cleaner_mutex, and a snapshot
deletion was running, and then the box crashed, we would come up in a
state where we have a half deleted snapshot.
Again, in the normal case the snapshot deletion needs to complete before
relocation can start, but in this case relocation could very well start
before the snapshot deletion completes, as we simply add the root to the
dead roots list and wait for the next time the cleaner runs to clean up
the snapshot.
Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
had a pending drop_progress key. If they do then we know we were in the
middle of the drop operation and set a flag on the fs_info. Then
balance can wait until this flag is cleared to start up again.
If there are DEAD_ROOT's that don't have a drop_progress set then we're
safe to start balance right away as we'll be properly protected by the
cleaner_mutex.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
User reported there is an array-index-out-of-bounds access while
mounting the crafted image:
[350.411942 ] loop0: detected capacity change from 0 to 262144
[350.427058 ] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (1044)
[350.428564 ] BTRFS info (device loop0): disk space caching is enabled
[350.428568 ] BTRFS info (device loop0): has skinny extents
[350.429589 ]
[350.429619 ] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
[350.429636 ] index 1048096 is out of range for type 'page *[16]'
[350.429650 ] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4
[350.429652 ] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[350.429653 ] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
[350.429772 ] Call Trace:
[350.429774 ] <TASK>
[350.429776 ] dump_stack_lvl+0x47/0x5c
[350.429780 ] ubsan_epilogue+0x5/0x50
[350.429786 ] __ubsan_handle_out_of_bounds+0x66/0x70
[350.429791 ] btrfs_get_16+0xfd/0x120 [btrfs]
[350.429832 ] check_leaf+0x754/0x1a40 [btrfs]
[350.429874 ] ? filemap_read+0x34a/0x390
[350.429878 ] ? load_balance+0x175/0xfc0
[350.429881 ] validate_extent_buffer+0x244/0x310 [btrfs]
[350.429911 ] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
[350.429935 ] end_bio_extent_readpage+0x3af/0x850 [btrfs]
[350.429969 ] ? newidle_balance+0x259/0x480
[350.429972 ] end_workqueue_fn+0x29/0x40 [btrfs]
[350.429995 ] btrfs_work_helper+0x71/0x330 [btrfs]
[350.430030 ] ? __schedule+0x2fb/0xa40
[350.430033 ] process_one_work+0x1f6/0x400
[350.430035 ] ? process_one_work+0x400/0x400
[350.430036 ] worker_thread+0x2d/0x3d0
[350.430037 ] ? process_one_work+0x400/0x400
[350.430038 ] kthread+0x165/0x190
[350.430041 ] ? set_kthread_struct+0x40/0x40
[350.430043 ] ret_from_fork+0x1f/0x30
[350.430047 ] </TASK>
[350.430047 ]
[350.430077 ] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
btrfs check reports:
corrupt leaf: root=3 block=20975616 physical=20975616 slot=1, unexpected
item end, have 4294971193 expect 3897
The first slot item offset is 4293005033 and the size is 1966160.
In check_leaf, we use btrfs_item_end() to check item boundary versus
extent_buffer data size. However, return type of btrfs_item_end() is u32.
(u32)(4293005033 + 1966160) == 3897, overflow happens and the result 3897
equals to leaf data size reasonably.
Fix it by use u64 variable to store item data end in check_leaf() to
avoid u32 overflow.
This commit does solve the invalid memory access showed by the stack
trace. However, its metadata profile is DUP and another copy of the
leaf is fine. So the image can be mounted successfully. But when umount
is called, the ASSERT btrfs_mark_buffer_dirty() will be triggered
because the only node in extent tree has 0 item and invalid owner. It's
solved by another commit
"btrfs: check extent buffer owner against the owner rootid".
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week
WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This was partially fixed by c2e3930529 ("btrfs: clear extent buffer
uptodate when we fail to write it"), however all that fix did was keep
us from finding extent buffers after a failed writeout. It didn't keep
us from continuing to use a buffer that we already had found.
In this case we're searching the commit root to cache the block group,
so we can start committing the transaction and switch the commit root
and then start writing. After the switch we can look up an extent
buffer that hasn't been written yet and start processing that block
group. Then we fail to write that block out and clear Uptodate on the
page, and then we start spewing these errors.
Normally we're protected by the tree lock to a certain degree here. If
we read a block we have that block read locked, and we block the writer
from locking the block before we submit it for the write. However this
isn't necessarily fool proof because the read could happen before we do
the submit_bio and after we locked and unlocked the extent buffer.
Also in this particular case we have path->skip_locking set, so that
won't save us here. We'll simply get a block that was valid when we
read it, but became invalid while we were using it.
What we really want is to catch the case where we've "read" a block but
it's not marked Uptodate. On read we ClearPageError(), so if we're
!Uptodate and !Error we know we didn't do the right thing for reading
the page.
Fix this by checking !Uptodate && !Error, this way we will not complain
if our buffer gets invalidated while we're using it, and we'll maintain
the spirit of the check which is to make sure we have a fully in-cache
block while we're messing with it.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a full fsync, if we have prealloc extents beyond (or at) eof,
and the leaves that contain them were not modified in the current
transaction, we end up not logging them. This results in losing those
extents when we replay the log after a power failure, since the inode is
truncated to the current value of the logged i_size.
Just like for the fast fsync path, we need to always log all prealloc
extents starting at or beyond i_size. The fast fsync case was fixed in
commit 471d557afe ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay") but it missed the full fsync path. The problem
exists since the very early days, when the log tree was added by
commit e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize
synchronous operations").
Example reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with many file extent items, so that they span
# several leaves of metadata, even if the node/page size is 64K. Use
# direct IO and not fsync/O_SYNC because it's both faster and it avoids
# clearing the full sync flag from the inode - we want the fsync below
# to trigger the slow full sync code path.
$ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
# Now add two preallocated extents to our file without extending the
# file's size. One right at i_size, and another further beyond, leaving
# a gap between the two prealloc extents.
$ xfs_io -c "falloc -k 16M 1M" /mnt/foo
$ xfs_io -c "falloc -k 20M 1M" /mnt/foo
# Make sure everything is durably persisted and the transaction is
# committed. This makes all created extents to have a generation lower
# than the generation of the transaction used by the next write and
# fsync.
sync
# Now overwrite only the first extent, which will result in modifying
# only the first leaf of metadata for our inode. Then fsync it. This
# fsync will use the slow code path (inode full sync bit is set) because
# it's the first fsync since the inode was created/loaded.
$ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
# Extent list before power failure.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x0
3: [32768..34815]: 2172928..2174975 2048 0x800
4: [34816..40959]: hole 6144
5: [40960..43007]: 2174976..2177023 2048 0x801
<power fail>
# Mount fs again, trigger log replay.
$ mount /dev/sdc /mnt
# Extent list after power failure and log replay.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x1
# The prealloc extents at file offsets 16M and 20M are missing.
So fix this by calling btrfs_log_prealloc_extents() when we are doing a
full fsync, so that we always log all prealloc extents beyond eof.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When looping btrfs/074 with 64K page size and 4K sectorsize, there is a
low chance (1/50~1/100) to crash with the following ASSERT() triggered
in btrfs_subpage_start_writer():
ret = atomic_add_return(nbits, &subpage->writers);
ASSERT(ret == nbits); <<< This one <<<
[CAUSE]
With more debugging output on the parameters of
btrfs_subpage_start_writer(), it shows a very concerning error:
ret=29 nbits=13 start=393216 len=53248
For @nbits it's correct, but @ret which is the returned value from
atomic_add_return(), it's not only larger than nbits, but also larger
than max sectors per page value (for 64K page size and 4K sector size,
it's 16).
This indicates that some call sites are not properly decreasing the value.
And that's exactly the case, in btrfs_page_unlock_writer(), due to the
fact that we can have page locked either by lock_page() or
process_one_page(), we have to check if the subpage has any writer.
If no writers, it's locked by lock_page() and we only need to unlock it.
But unfortunately the check for the writers are completely opposite:
if (atomic_read(&subpage->writers))
/* No writers, locked by plain lock_page() */
return unlock_page(page);
We directly unlock the page if it has writers, which is the completely
opposite what we want.
Thankfully the affected call site is only limited to
extent_write_locked_range(), so it's mostly affecting compressed write.
[FIX]
Just fix the wrong check condition to fix the bug.
Fixes: e55a0de185 ("btrfs: rework page locking in __extent_writepage()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
z_idataoff here is an absolute physical offset, so it should use
erofs_off_t (64 bits at least). Otherwise, it'll get trimmed and
cause the decompresion failure.
Link: https://lore.kernel.org/r/20220222033118.20540-1-hsiangkao@linux.alibaba.com
Fixes: ab92184ff8 ("erofs: add on-disk compressed tail-packing inline support")
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Even if we're not able to cache all the entries in the readdir buffer,
let's ensure that we do prime the dcache.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Avoid clearing the entire readdir page cache if we're just doing forced
readdirplus for the 'ls -l' heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Instead of using a linear index to address the pages, use the cookie of
the first entry, since that is what we use to match the page anyway.
This allows us to avoid re-reading the entire cache on a seekdir() type
of operation. The latter is very common when re-exporting NFS, and is a
major performance drain.
The change does affect our duplicate cookie detection, since we can no
longer rely on the page index as a linear offset for detecting whether
we looped backwards. However since we no longer do a linear search
through all the pages on each call to nfs_readdir(), this is less of a
concern than it was previously.
The other downside is that invalidate_mapping_pages() no longer can use
the page index to avoid clearing pages that have been read. A subsequent
patch will restore the functionality this provides to the 'ls -l'
heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add tracing to track how often the client goes to the server for updated
readdir information.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the revalidation was forced, due to the presence of a LOOKUP_EXCL or
a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help
when we're doing a path component lookup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the filesystem is case insensitive, then readdirplus can't help with
cache misses, since it won't return case folded variants of the filename.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Instead of pretending that we know the ratio of directory info vs
readdirplus attribute info, just set the 'dircount' field to the same
value as the 'maxcount' field.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If attribute caching is turned off, then use of readdirplus is not going
to help stat() performance.
Readdirplus also doesn't help if a file is being written to, since we
will have to flush those writes in order to sync the mtime/ctime.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.
The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.
To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When reading a very large directory, we want to try to keep the page
cache up to date if doing so is inexpensive. With the change to allow
readdir to continue reading even when the cache is incomplete, we no
longer need to fall back to uncached readdir in order to scale to large
directories.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Recent changes to readdir mean that we can cope with partially filled
page cache entries, so we no longer need to rely on looping in
nfs_readdir_xdr_to_array().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that if the cookie verifier changes when we use the zero-valued
cookie, then we invalidate any cached pages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.
This patch therefore tries to tone down the amount of readahead we
perform, and adjust it to try to match the amount of data being
requested by user space.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we hit the end of the data in the readdir page, we don't want to
start filling a new page, unless this one is full.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the page cache entry that was last read gets invalidated for some
reason, then make sure we can re-create it on the next call to readdir.
This, combined with the cache page validation, allows us to reuse the
cached value of page-index on successive calls to nfs_readdir.
Credit is due to Benjamin Coddington for showing that the concept works,
and that it allows for improved cache sharing between processes even in
the case where pages are lost due to LRU or active invalidation.
Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Use the change attribute and the first cookie in a directory page cache
entry to validate that the page is up to date.
Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Remove the second 'from'.
Replace 'backwords' with 'backwards'.
Replace 'visibile' with 'visible'.
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220211160940.2516243-1-trix@redhat.com
I delete load_addr because it is not used anymore. And I rename
load_addr_set to first_pt_load because it is used only to capture the
first iteration of the loop.
Signed-off-by: Akira Kawata <akirakawata1@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127124014.338760-3-akirakawata1@gmail.com
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=197921
As pointed out in the discussion of buglink, we cannot calculate AT_PHDR
as the sum of load_addr and exec->e_phoff.
: The AT_PHDR of ELF auxiliary vectors should point to the memory address
: of program header. But binfmt_elf.c calculates this address as follows:
:
: NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
:
: which is wrong since e_phoff is the file offset of program header and
: load_addr is the memory base address from PT_LOAD entry.
:
: The ld.so uses AT_PHDR as the memory address of program header. In normal
: case, since the e_phoff is usually 64 and in the first PT_LOAD region, it
: is the correct program header address.
:
: But if the address of program header isn't equal to the first PT_LOAD
: address + e_phoff (e.g. Put the program header in other non-consecutive
: PT_LOAD region), ld.so will try to read program header from wrong address
: then crash or use incorrect program header.
This is because exec->e_phoff
is the offset of PHDRs in the file and the address of PHDRs in the
memory may differ from it. This patch fixes the bug by calculating the
address of program headers from PT_LOADs directly.
Signed-off-by: Akira Kawata <akirakawata1@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127124014.338760-2-akirakawata1@gmail.com
struct linux_binfmt::core_dump and struct min_coredump::min_coredump
are used under CONFIG_COREDUMP only. Shrink those embedded configs
a bit.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/YglbIFyN+OtwVyjW@localhost.localdomain
Quoting[1] Ariadne Conill:
"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:
The argument arg0 should point to a filename string that is
associated with the process being started by one of the exec
functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.
This issue is being tracked in the KSPP issue tracker[5]."
While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.
The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.
Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:
process './argc0' launched './argc0' with NULL argv: empty string added
Additionally WARN() and reject NULL argv usage for kernel threads.
[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
When I rewrote the VMA dumping logic for coredumps, I changed it to
recognize ELF library mappings based on the file being executable instead
of the mapping having an ELF header. But turns out, distros ship many ELF
libraries as non-executable, so the heuristic goes wrong...
Restore the old behavior where FILTER(ELF_HEADERS) dumps the first page of
any offset-0 readable mapping that starts with the ELF magic.
This fix is technically layer-breaking a bit, because it checks for
something ELF-specific in fs/coredump.c; but since we probably want to
share this between standard ELF and FDPIC ELF anyway, I guess it's fine?
And this also keeps the change small for backporting.
Cc: stable@vger.kernel.org
Fixes: 429a22e776 ("coredump: rework elf/elf_fdpic vma_dump_size() into common helper")
Reported-by: Bill Messmer <wmessmer@microsoft.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220126025739.2014888-1-jannh@google.com
Kernel assumes that ELF program headers are ordered by mapping address,
but doesn't enforce it. It is possible to make mapping size extremely huge
by simply shuffling first and last PT_LOAD segments.
As long as PT_LOAD segments do not overlap, it is silly to require
sorting by v_addr anyway because mmap() doesn't care.
Don't assume PT_LOAD segments are sorted and calculate min and max
addresses correctly.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: "Magnus Groß" <magnus.gross@rwth-aachen.de>
Link: https://lore.kernel.org/all/Yfqm7HbucDjPbES+@fractal.localdomain/
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/lkml/YVmd7D0M6G%2FDcP4O@localhost.localdomain