Pull vfs fixes from Al Viro:
"Assorted fixes all over the place.
The iov_iter one is this cycle regression (splice from UDP triggering
WARN_ON()), the rest is older"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Use d_instantiate() rather than d_add() and don't d_drop()
afs: Fix missing net error handling
afs: Fix validation/callback interaction
iov_iter: teach csum_and_copy_to_iter() to handle pipe-backed ones
exportfs: do not read dentry after free
exportfs: fix 'passing zero to ERR_PTR()' warning
aio: fix failure to put the file pointer
sysv: return 'err' instead of 0 in __sysv_write_inode
Use d_instantiate() rather than d_add() and don't d_drop() in
afs_vnode_new_inode(). The dentry shouldn't be removed as it's not
changing its name.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kAFS can be given certain network errors (EADDRNOTAVAIL, EHOSTDOWN and
ERFKILL) that it doesn't handle in its server/address rotation algorithms.
They cause the probing and rotation to abort immediately rather than
rotating.
Fix this by:
(1) Abstracting out the error prioritisation from the VL and FS rotation
algorithms into a common function and expand usage into the server
probing code.
When multiple errors are available, this code selects the one we'd
prefer to return.
(2) Add handling for EADDRNOTAVAIL, EHOSTDOWN and ERFKILL.
Fixes: 0fafdc9f88 ("afs: Fix file locking")
Fixes: 0338747d8454 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When afs_validate() is called to validate a vnode (inode), there are two
unhandled cases in the fastpath at the top of the function:
(1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
counters match and the data has expired, then there's an implicit case
in which the vnode needs revalidating.
This has no consequences since the default "valid = false" set at the
top of the function happens to do the right thing.
(2) If the vnode is not promised and it hasn't been deleted
(AFS_VNODE_DELETED is not set) then there's a default case we're not
handling in which the vnode is invalid. If the vnode is invalid, we
need to bring cb_s_break and cb_v_break up to date before we refetch
the status.
As a consequence, once the server loses track of the client
(ie. sufficient time has passed since we last sent it an operation),
it will send us a CB.InitCallBackState* operation when we next try to
talk to it. This calls afs_init_callback_state() which increments
afs_server::cb_s_break, but this then doesn't propagate to the
afs_vnode record.
The result being that every afs_validate() call thereafter sends a
status fetch operation to the server.
Clarify and fix this by:
(A) Setting valid in all the branches rather than initialising it at the
top so that the compiler catches where we've missed.
(B) Restructuring the logic in the 'promised' branch so that we set valid
to false if the callback is due to expire (or has expired) and so that
the final case is that the vnode is still valid.
(C) Adding an else-statement that ups cb_s_break and cb_v_break if the
promised and deleted cases don't match.
Fixes: c435ee3455 ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The actual number of bytes stored in a PRZ is smaller than the
bytes requested by platform data, since there is a header on each
PRZ. Additionally, if ECC is enabled, there are trailing bytes used
as well. Normally this mismatch doesn't matter since PRZs are circular
buffers and the leading "overflow" bytes are just thrown away. However, in
the case of a compressed record, this rather badly corrupts the results.
This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
Any stored crashes would not be uncompressable (producing a pstorefs
"dmesg-*.enc.z" file), and triggering errors at boot:
[ 2.790759] pstore: crypto_comp_decompress failed, ret = -22!
Backporting this depends on commit 70ad35db33 ("pstore: Convert console
write to use ->write_buf")
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Fixes: b0aad7a99c ("pstore: Add compression support to pstore")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlv/mEEACgkQnJ2qBz9k
QNm0kggA8SckBccdUZqpjxohASx9crp+jJeor2TRYCwyVfqdbDjkCOfv9k7+ddD4
D1qN/AEudQXPzr/DrjI0W9fnFG957FLo60RQ0aGwsq/3wfmzfMJ0qXIw2tyoqjIH
VxFecL2f3TKMr5zU9N1QoxJVAMAo5LOGbXO+qNYgTaOJ0EBpw9kxK14ib9JOi+pQ
CItZkAv4eOWMsA/AaRsWN7AGmsziwcMdoAZYRT2wiWdTmubYn66Negnffq2jVHjP
/kYlFjNcgpsuTg1AiTM+JlBwctEeLm37PUyimip3cdYwu6HcfNmCHDjEIzN2YphJ
twzAauTEr0bjNFiNWYraUPVHEcspNw==
=kKDs
-----END PGP SIGNATURE-----
Merge tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and udf fixes from Jan Kara:
"Three small ext2 and udf fixes"
* tag 'fixes_for_v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix potential use after free
ext2: initialize opts.s_mount_opt as zero before using it
udf: Allow mounting volumes with incorrect identification strings
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlv9qYgACgkQxWXV+ddt
WDupPQ/8DdeLZYQG1tlx2Q+X4/tqPVyAzUjguYzbIY7wvSs1zbEEedENsD8E97yC
So8ooGnP5B6/dqVidLFQBPwTXN59GybYbrDci8qh0DOJTl3+1r8byD9JC+iofrOF
tltJkZ+eCOQyyqHHzlzw15uNOg48Qzj1oXvTAcE0P6iN5UcvcfwRW/S39pjsn63C
63zc09XJ1hmJMJTWZo5h3GoD2UvzrwGXPKXNdv/NWkw9sqQbWdjvZFdqKbvY1VeM
Oa6FPAPErJqEEEePhpDYbyRcnzjJRMs0deLGpGGChGldQxgMO8ILzBwh/KalfzK7
h7LIuv1EclUqlyv0mXPqg2E/C3n2UMPqQYFsK9Lt+4Y/PkrWA2jx0lSg0fBl3k8c
7PyiTqPNPNF8LU48tPEnOzJuNPkquOycgdyQOUpHnS43OF5OLIb6tVyjK4eJHRWw
xtP65M72qM8T65+gsxYcdm0lvIDLidIwFS+2g4ibKU7EwlYkTC9AHFIAyFKTgxeP
MpkIH90mKhSxOpbq8RICgr2jWcJZYoFQ4soi1oE+bgyjv75PyhJ0eXOprCh/4KZp
nkXlPy2skkO9gGecyvr51x/opDEjEkObyOjQm2LhhWYvgcnHgW8Zp1jhQKxabHvz
iZdVIs/agOerpk1d9ZBHhIXOeS2UcE5klqVRAdf961Wobh+HNis=
=cCvI
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Some of these bugs are being hit during testing so we'd like to get
them merged, otherwise there are usual stability fixes for stable
trees"
* tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: relocation: set trans to be NULL after ending transaction
Btrfs: fix race between enabling quotas and subvolume creation
Btrfs: send, fix infinite loop due to directory rename dependencies
Btrfs: ensure path name is null terminated at btrfs_control_ioctl
Btrfs: fix rare chances for data loss when doing a fast fsync
btrfs: Always try all copies when reading extent buffers
The function ext2_xattr_set calls brelse(bh) to drop the reference count
of bh. After that, bh may be freed. However, following brelse(bh),
it reads bh->b_data via macro HDR(bh). This may result in a
use-after-free bug. This patch moves brelse(bh) after reading field.
CC: stable@vger.kernel.org
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We need to initialize opts.s_mount_opt as zero before using it, else we
may get some unexpected mount options.
Fixes: 088519572c ("ext2: Parse mount options into a dedicated structure")
CC: stable@vger.kernel.org
Signed-off-by: xingaopeng <xingaopeng@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Highlights include:
Bugfixes:
- Fix a NFSv4 state manager deadlock when returning a delegation
- NFSv4.2 copy do not allocate memory under the lock
- flexfiles: Use the correct stateid for IO in the tightly coupled case
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb+hCNAAoJEA4mA3inWBJc8ZQP/jR+uemJycwgyWINvnn6PEtE
hyiSwL+c3jhBHwX2IroF1KvaHIa8GXMbIWj+DfW1iB2htYnIJYz4IFJOGpfN1S7n
bKCgonV0V06+dFF4DqcL3HcM1L6bo26n16voi3otgY0R5U5HGwB1tocZPCbR6DpK
meiRbrmB6O962zluUlTuu9zFSvsALyZR0h4tYMGYA0MlgWQJVLH6+dufyG2Zgp2Z
OH9tUzRFknD/Q4KrJv7zrMY198mHa+RQovsO2Jc/iE4bbrSMyVNtrPuVJphsP1BD
lZ5SvvWLXjNepUMsDCK+Es7i6dUmtHsGPS6gNDwUwY9/UlwOPYlp44VJzmEYmQcz
/VrrHn3LSoKDSAVNrksghto9O4T1NPnuVja1Q+SHf5hVX5OlsxyDkvX23ZUdgdkW
BeXeNWZuAJdDTI1KU+ahm2ilfUnuFpRGRHUrH2sYczV2okC38cO5YCIRI3Tckz6e
jrhmJcw+zCWv3Yl3h2Rbf8fVRcWJHA+qLWT3Str5nCyZiqPCag7Z7br9r5316zDv
Yma7nITZO7HH1cZUv+byA0PVHU96kDsMhhpxYISrSr4lf2BcZNnjQC/0IHb7qdWz
FgpYzv/BsIi+KxyZKshiR5E60kOmVxv2wIhre8uLOuuabcGsh/wit6URVnQJ+GDp
7klRY1t1P24XaIbgBR9U
=hqbe
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix a NFSv4 state manager deadlock when returning a delegation
- NFSv4.2 copy do not allocate memory under the lock
- flexfiles: Use the correct stateid for IO in the tightly coupled case
* tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
flexfiles: use per-mirror specified stateid for IO
NFSv4.2 copy do not allocate memory under the lock
NFSv4: Fix a NFSv4 state manager deadlock
We found some bugs in the DAX conversion to XArray (and one bug which
predated the XArray conversion). There were a couple of bugs in some of
the higher-level functions, which aren't actually being called in today's
kernel, but surfaced as a result of converting existing radix tree &
IDR users over to the XArray. Some of the other changes to how the
higher-level APIs work were also motivated by converting various users;
again, they're not in use in today's kernel, so changing them has a low
probability of introducing a bug.
Dan can still trigger a bug in the DAX code with hot-offline/online,
and we're working on tracking that down.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlv542AUHHdpbGx5QGlu
ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5BoAf/QZzbBcYuYMLMDYofvHKGlmk2yx/a
ObUlxlQtXGHvPp3oC3rdwAvcN/KAMDpU0u+PXab2MnrNw5okhpS6ZwGODlkarNA4
XbVQNGbtEbACr1V3CWc0NzLbYm6JtGpMum0Wx9MVR/VdTnGArBLBYQMYa/c1YhKA
vEBPf+w0j0QoCTAgPiIvq0aksuBQERUvjhlUvoaMY7F4sAhnaW558lvaEcc1xGxq
70+3cRPT6Uh12tEvi0LKP1NNEXebvQSftMvFEUPF2xo5z2v//KEobzv/anbojxQ8
BtxouIGSr4tME9g3xSpd9rTbUcW3bwDAhuWZvpP/ViRwW2UkEQonpApdaw==
=0Ert
-----END PGP SIGNATURE-----
Merge tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax
Pull XArray updates from Matthew Wilcox:
"We found some bugs in the DAX conversion to XArray (and one bug which
predated the XArray conversion). There were a couple of bugs in some
of the higher-level functions, which aren't actually being called in
today's kernel, but surfaced as a result of converting existing radix
tree & IDR users over to the XArray.
Some of the other changes to how the higher-level APIs work were also
motivated by converting various users; again, they're not in use in
today's kernel, so changing them has a low probability of introducing
a bug.
Dan can still trigger a bug in the DAX code with hot-offline/online,
and we're working on tracking that down"
* tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax:
XArray tests: Add missing locking
dax: Avoid losing wakeup in dax_lock_mapping_entry
dax: Fix huge page faults
dax: Fix dax_unlock_mapping_entry for PMD pages
dax: Reinstate RCU protection of inode
dax: Make sure the unlocking entry isn't locked
dax: Remove optimisation from dax_lock_mapping_entry
XArray tests: Correct some 64-bit assumptions
XArray: Correct xa_store_range
XArray: Fix Documentation
XArray: Handle NULL pointers differently for allocation
XArray: Unify xa_store and __xa_store
XArray: Add xa_store_bh() and xa_store_irq()
XArray: Turn xa_erase into an exported function
XArray: Unify xa_cmpxchg and __xa_cmpxchg
XArray: Regularise xa_reserve
nilfs2: Use xa_erase_irq
XArray: Export __xa_foo to non-GPL modules
XArray: Fix xa_for_each with a single element at 0
- Numerous corruption fixes for copy on write
- Numerous corruption fixes for blocksize < pagesize writes
- Don't miscalculate AG reservations for small final AGs
- Fix page cache truncation to work properly for reflink and extent
shifting
- Fix use-after-free when retrying failed inode/dquot buffer logging
- Fix corruptions seen when using copy_file_range in directio mode
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlv1oroACgkQ+H93GTRK
tOu+1RAAnteFwaq3WLDYmSrMia/4m52sxvatxlCCSjCGdvfvuZozTwbTdB6FFfuc
Ql6Z6F2Lx1sHDNJvwBCsO8qPB0qOhjSnBI/wPe2kz/NETGYNp88vHX7OZvkPVONl
jDaCWTcu0BWNiOGi17uTY8sBa8u1izbo5F+pEQIyUjoCgUc9JB2di9dVnUJ0byrh
wZjrmPD95ojqOozqppXfFQ0QIbozpTXR3kyU9S0EhHmbnWJZ9t08Iuhd2LjOoDB4
cUFG/1qDXuFvALyM8m1mA7xSBZpA/glFgNeAtmz53aIOZ9Tl8w8cLJJBRx5AqUDU
bpBU1y08Bm3OIw+uiTMkiPkCQRMDgtiHKlPxuiKqlsNY0KqYgwWlWcbU/OTvHly8
In+CnbEBqLejKyEIz3nEQ4YimfvHbAlC/3V+nx2qO45hvTXA5lEIGAbBmiLW0ni8
6eBXGeIjKAw0zYOoXC4OuKIiHlQh7AHJB25i9xJTzknRI9jqwZFGkxgdl33Vrq8W
gTnfgOhMX2dGmcPrgMgtu+aiBwKf+GJv94/2EJwligExnWXQSsQmGCwKl7ysoE1g
iU/MhJT5IYYP/TDqldkahUPSwD2FN4UFtzNfpeDX3H6kxM1R41l+aerdu64UPNji
G98U+cWyyUmbu9ziLyREM/XyWz4UhNAz7lRId3ryeu8GPUm2AoY=
=TiLQ
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Dave and I have continued our work fixing corruption problems that can
be found when running long-term burn-in exercisers on xfs. Here are
some patches fixing most of the problems, but there will likely be
more. :/
- Numerous corruption fixes for copy on write
- Numerous corruption fixes for blocksize < pagesize writes
- Don't miscalculate AG reservations for small final AGs
- Fix page cache truncation to work properly for reflink and extent
shifting
- Fix use-after-free when retrying failed inode/dquot buffer logging
- Fix corruptions seen when using copy_file_range in directio mode"
* tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: readpages doesn't zero page tail beyond EOF
vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
iomap: dio data corruption and spurious errors when pipes fill
iomap: sub-block dio needs to zeroout beyond EOF
iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
xfs: delalloc -> unwritten COW fork allocation can go wrong
xfs: flush removing page cache in xfs_reflink_remap_prep
xfs: extent shifting doesn't fully invalidate page cache
xfs: finobt AG reserves don't consider last AG can be a runt
xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
xfs: uncached buffer tracing needs to print bno
xfs: make xfs_file_remap_range() static
xfs: fix shared extent data corruption due to missing cow reservation
- Fix tasks freezer deadlock in de_thread() that occurs if one
of its sub-threads has been frozen already (Chanho Min).
- Avoid registering a platform device by the ti-cpufreq driver
on platforms that cannot use it (Dave Gerlach).
- Fix a mistake in the ti-opp-supply operating performance points
(OPP) driver that caused an incorrect reference voltage to be
used and make it adjust the minimum voltage dynamically to avoid
hangs or crashes in some cases (Keerthy).
- Fix issues related to compiler flags in the cpupower utility
and correct a linking problem in it by renaming a file with
a duplicate name (Jiri Olsa, Konstantin Khlebnikov).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJb99ANAAoJEILEb/54YlRxUk4QAIkpXmLdEm0zvQXJZKJxuxu/
8eiglPcoCAAVgR3uRjezTyQeFCDjqGIqmpiipFOkxgAaeb+aiFta2B88B4oZ3OAf
G5yJG4CMs5+Tp/oH2+McX4PMo2WfoEIDoOGVOdtlk4iGAIh9tT5KvNfCUyxHLP7c
cFjdeUQuc8PRPzutESxYHB23FmECCEprVmEFVckQ7vSCnWX6dm1mFteCiNZADH8/
YNgRWRYtyWXlPRNCPJuCvYmVHLgm0Tw6CVhG4ttXT9wIkYyiLK9Flx9X6gsBWdoS
ej1jbJM7R/EAUzHvCjUCSfMKDNWvQySR3gC2m0vG5Goext2UXWieSHaI6BfqjPP2
wTBX9lAKu3iIGISoorjJZAMe4v26Tpbs0v5ApsOGr50WNZKZtRkBwaJ0EFT2hEa8
MsDtUjr+d7CIQ+ZDHmAbOzghpDlSuhGypfkD7IPtA/AY1dy1wJ3BUYya8ucqSdr3
iKkQcndN3ASR1+Fyb3c1cflh9fWe84aeSmYPihcX2m+/D43keYXbj2ANdCH1dF3E
cwAXw9+VteUQ1tihqgJapFogK3VphMbRErPSGXryuyiMxr38g7tgHAd2rId0JG4g
QftqpZa19aEed8tJNaQWWjBaBFgYeAT9BQqcP6CjFOO9klPH//NSiqzQUFMdzsEc
Qql1V/aoOIsflb7gJH3/
=JQxr
-----END PGP SIGNATURE-----
Merge tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two issues in the Operating Performance Points (OPP)
framework, one cpufreq driver issue, one problem related to the tasks
freezer and a few build-related issues in the cpupower utility.
Specifics:
- Fix tasks freezer deadlock in de_thread() that occurs if one of its
sub-threads has been frozen already (Chanho Min).
- Avoid registering a platform device by the ti-cpufreq driver on
platforms that cannot use it (Dave Gerlach).
- Fix a mistake in the ti-opp-supply operating performance points
(OPP) driver that caused an incorrect reference voltage to be used
and make it adjust the minimum voltage dynamically to avoid hangs
or crashes in some cases (Keerthy).
- Fix issues related to compiler flags in the cpupower utility and
correct a linking problem in it by renaming a file with a duplicate
name (Jiri Olsa, Konstantin Khlebnikov)"
* tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
exec: make de_thread() freezable
cpufreq: ti-cpufreq: Only register platform_device when supported
opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call
opp: ti-opp-supply: Dynamically update u_volt_min
tools cpupower: Override CFLAGS assignments
tools cpupower debug: Allow to use outside build flags
tools/power/cpupower: fix compilation with STATIC=true
The function dentry_connected calls dput(dentry) to drop the previously
acquired reference to dentry. In this case, dentry can be released.
After that, IS_ROOT(dentry) checks the condition
(dentry == dentry->d_parent), which may result in a use-after-free bug.
This patch directly compares dentry with its parent obtained before
dropping the reference.
Fixes: a056cc8934c("exportfs: stop retrying once we race with
rename/remove")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.
Fixes: 0647bf564f ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rfc8435 says:
For tight coupling, ffds_stateid provides the stateid to be used by
the client to access the file.
However current implementation replaces per-mirror provided stateid with
by open or lock stateid.
Ensure that per-mirror stateid is used by ff_layout_write_prepare_v4 and
nfs4_ff_layout_prepare_ds.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Bruce pointed out that we shouldn't allocate memory while holding
a lock in the nfs4_callback_offload() and handle_async_copy()
that deal with a racing CB_OFFLOAD and reply to COPY case.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We have a race between enabling quotas end subvolume creation that cause
subvolume creation to fail with -EINVAL, and the following diagram shows
how it happens:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_ioctl()
create_subvol()
btrfs_qgroup_inherit()
-> save fs_info->quota_root
into quota_root
-> stores a NULL value
-> tries to lock the mutex
qgroup_ioctl_lock
-> blocks waiting for
the task at CPU0
-> sets BTRFS_FS_QUOTA_ENABLED in fs_info
-> sets quota_root in fs_info->quota_root
(non-NULL value)
mutex_unlock(fs_info->qgroup_ioctl_lock)
-> checks quota enabled
flag is set
-> returns -EINVAL because
fs_info->quota_root was
NULL before it acquired
the mutex
qgroup_ioctl_lock
-> ioctl returns -EINVAL
Returning -EINVAL to user space will be confusing if all the arguments
passed to the subvolume creation ioctl were valid.
Fix it by grabbing the value from fs_info->quota_root after acquiring
the mutex.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we read the EOF page of the file via readpages, we need
to zero the region beyond EOF that we either do not read or
should not contain data so that mmap does not expose stale data to
user applications.
However, iomap_adjust_read_range() fails to detect EOF correctly,
and so fsx on 1k block size filesystems fails very quickly with
mapreads exposing data beyond EOF. There are two problems here.
Firstly, when calculating the end block of the EOF byte, we have
to round the size by one to avoid a block aligned EOF from reporting
a block too large. i.e. a size of 1024 bytes is 1 block, which in
index terms is block 0. Therefore we have to calculate the end block
from (isize - 1), not isize.
The second bug is determining if the current page spans EOF, and so
whether we need split it into two half, one for the IO, and the
other for zeroing. Unfortunately, the code that checks whether
we should split the block doesn't actually check if we span EOF, it
just checks if the read spans the /offset in the page/ that EOF
sits on. So it splits every read into two if EOF is not page
aligned, regardless of whether we are reading the EOF block or not.
Hence we need to restrict the "does the read span EOF" check to
just the page that spans EOF, not every page we read.
This patch results in correct EOF detection through readpages:
xfs_vm_readpages: dev 259:0 ino 0x43 nr_pages 24
xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
As you can see, it now does full page reads until the last one which
is split correctly at the block aligned EOF, reading 3072 bytes and
zeroing the last 1024 bytes. The original version of the patch got
this right, but it got another case wrong.
The EOF detection crossing really needs to the the original length
as plen, while it starts at the end of the block, will be shortened
as up-to-date blocks are found on the page. This means "orig_pos +
plen" no longer points to the end of the page, and so will not
correctly detect EOF crossing. Hence we have to use the length
passed in to detect this partial page case:
xfs_filemap_fault: dev 259:1 ino 0x43 write_fault 0
xfs_vm_readpage: dev 259:1 ino 0x43 nr_pages 1
xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
Heere we see a trace where the first block on the EOF page is up to
date, hence poff = 1024 bytes. The offset into the page of EOF is
3072, so the range we want to read is 1024 - 3071, and the range we
want to zero is 3072 - 4095. You can see this is split correctly
now.
This fixes the stale data beyond EOF problem that fsx quickly
uncovers on 1k block size filesystems.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It returns EINVAL when the operation is not supported by the
filesystem. Fix it to return EOPNOTSUPP to be consistent with
the man page and clone_file_range().
Clean up the inconsistent error return handling while I'm there.
(I know, lipstick on a pig, but every little bit helps...)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing direct IO to a pipe for do_splice_direct(), then pipe is
trivial to fill up and overflow as it can only hold 16 pages. At
this point bio_iov_iter_get_pages() then returns -EFAULT, and we
abort the IO submission process. Unfortunately, iomap_dio_rw()
propagates the error back up the stack.
The error is converted from the EFAULT to EAGAIN in
generic_file_splice_read() to tell the splice layers that the pipe
is full. do_splice_direct() completely fails to handle EAGAIN errors
(it aborts on error) and returns EAGAIN to the caller.
copy_file_write() then completely fails to handle EAGAIN as well,
and so returns EAGAIN to userspace, having failed to copy the data
it was asked to.
Avoid this whole steaming pile of fail by having iomap_dio_rw()
silently swallow EFAULT errors and so do short reads.
To make matters worse, iomap_dio_actor() has a stale data exposure
bug bio_iov_iter_get_pages() fails - it does not zero the tail block
that it may have been left uncovered by partial IO. Fix the error
handling case to drop to the sub-block zeroing rather than
immmediately returning the -EFAULT error.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we are doing sub-block dio that extends EOF, we need to zero
the unused tail of the block to initialise the data in it it. If we
do not zero the tail of the block, then an immediate mmap read of
the EOF block will expose stale data beyond EOF to userspace. Found
with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
Fix this by detecting if the end of the DIO write is beyond EOF
and zeroing the tail if necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we write into an unwritten extent via direct IO, we dirty
metadata on IO completion to convert the unwritten extent to
written. However, when we do the FUA optimisation checks, the inode
may be clean and so we issue a FUA write into the unwritten extent.
This means we then bypass the generic_write_sync() call after
unwritten extent conversion has ben done and we don't force the
modified metadata to stable storage.
This violates O_DSYNC semantics. The window of exposure is a single
IO, as the next DIO write will see the inode has dirty metadata and
hence will not use the FUA optimisation. Calling
generic_write_sync() after completion of the second IO will also
sync the first write and it's metadata.
Fix this by avoiding the FUA optimisation when writing to unwritten
extents.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Long saga. There have been days spent following this through dead end
after dead end in multi-GB event traces. This morning, after writing
a trace-cmd wrapper that enabled me to be more selective about XFS
trace points, I discovered that I could get just enough essential
tracepoints enabled that there was a 50:50 chance the fsx config
would fail at ~115k ops. If it didn't fail at op 115547, I stopped
fsx at op 115548 anyway.
That gave me two traces - one where the problem manifested, and one
where it didn't. After refining the traces to have the necessary
information, I found that in the failing case there was a real
extent in the COW fork compared to an unwritten extent in the
working case.
Walking back through the two traces to the point where the CWO fork
extents actually diverged, I found that the bad case had an extra
unwritten extent in it. This is likely because the bug it led me to
had triggered multiple times in those 115k ops, leaving stray
COW extents around. What I saw was a COW delalloc conversion to an
unwritten extent (as they should always be through
xfs_iomap_write_allocate()) resulted in a /written extent/:
xfs_writepage: dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
xfs_iext_remove: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
xfs_bmap_pre_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
Basically, Cow fork before:
0 1 32 52
+H+DDDDDDDDDDDD+UUUUUUUUUUU+
PREV RIGHT
COW delalloc conversion allocates:
1 32
+uuuuuuuuuuuu+
NEW
And the result according to the xfs_bmap_post_update trace was:
0 1 32 52
+H+wwwwwwwwwwwwwwwwwwwwwwww+
PREV
Which is clearly wrong - it should be a merged unwritten extent,
not an unwritten extent.
That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
the bug.
It takes the old delalloc extent (PREV) and adds the length of the
RIGHT extent to it, takes the start block from NEW, removes the
RIGHT extent and then updates PREV with the new extent.
What it fails to do is update PREV.br_state. For delalloc, this is
always XFS_EXT_NORM, while in this case we are converting the
delayed allocation to unwritten, so it needs to be updated to
XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
the resultant extent is always written.
And that's the bug I've been chasing for a week - a bmap btree bug,
not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
with the recent in core extent tree scalability enhancements.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On a sub-page block size filesystem, fsx is failing with a data
corruption after a series of operations involving copying a file
with the destination offset beyond EOF of the destination of the file:
8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW
8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes)
8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400
8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE
8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00
The second copy here is beyond EOF, and it is to sub-page (4k) but
block aligned (1k) offset. The clone runs the EOF zeroing, landing
in a pre-existing post-eof delalloc extent. This zeroes the post-eof
extents in the page cache just fine, dirtying the pages correctly.
The problem is that xfs_reflink_remap_prep() now truncates the page
cache over the range that it is copying it to, and rounds that down
to cover the entire start page. This removes the dirty page over the
delalloc extent from the page cache without having written it back.
Hence later, when the page cache is flushed, the page at offset
0x6f000 has not been written back and hence exposes stale data,
which fsx trips over less than 10 operations later.
Fix this by changing xfs_reflink_remap_prep() to use
xfs_flush_unmap_range().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent shifting code uses a flush and invalidate mechainsm prior
to shifting extents around. This is similar to what
xfs_free_file_space() does, but it doesn't take into account things
like page cache vs block size differences, and it will fail if there
is a page that it currently busy.
xfs_flush_unmap_range() handles all of these cases, so just convert
xfs_prepare_shift() to us that mechanism rather than having it's own
special sauce.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The last AG may be very small comapred to all other AGs, and hence
AG reservations based on the superblock AG size may actually consume
more space than the AG actually has. This results on assert failures
like:
XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
[ 48.932891] xfs_ag_resv_init+0x1bd/0x1d0
[ 48.933853] xfs_fs_reserve_ag_blocks+0x37/0xb0
[ 48.934939] xfs_mountfs+0x5b3/0x920
[ 48.935804] xfs_fs_fill_super+0x462/0x640
[ 48.936784] ? xfs_test_remount_options+0x60/0x60
[ 48.937908] mount_bdev+0x178/0x1b0
[ 48.938751] mount_fs+0x36/0x170
[ 48.939533] vfs_kern_mount.part.43+0x54/0x130
[ 48.940596] do_mount+0x20e/0xcb0
[ 48.941396] ? memdup_user+0x3e/0x70
[ 48.942249] ksys_mount+0xba/0xd0
[ 48.943046] __x64_sys_mount+0x21/0x30
[ 48.943953] do_syscall_64+0x54/0x170
[ 48.944835] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Hence we need to ensure the finobt per-ag space reservations take
into account the size of the last AG rather than treat it like all
the other full size AGs.
Note that both refcountbt and rmapbt already take the size of the AG
into account via reading the AGF length directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When retrying a failed inode or dquot buffer,
xfs_buf_resubmit_failed_buffers() clears all the failed flags from
the inde/dquot log items. In doing so, it also drops all the
reference counts on the buffer that the failed log items hold. This
means it can drop all the active references on the buffer and hence
free the buffer before it queues it for write again.
Putting the buffer on the delwri queue takes a reference to the
buffer (so that it hangs around until it has been written and
completed), but this goes bang if the buffer has already been freed.
Hence we need to add the buffer to the delwri queue before we remove
the failed flags from the log items attached to the buffer to ensure
it always remains referenced during the resubmit process.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fix a deadlock whereby the NFSv4 state manager can get stuck in the
delegation return code, waiting for a layout return to complete in
another thread. If the server reboots before that other thread
completes, then we need to be able to start a second state
manager thread in order to perform recovery.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
static.
This addresses a gcc warning when -Wmissing-prototypes is enabled.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.
fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation. This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.
The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.
For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.
This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.
Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull networking fixes from David Miller:
1) Fix some potentially uninitialized variables and use-after-free in
kvaser_usb can drier, from Jimmy Assarsson.
2) Fix leaks in qed driver, from Denis Bolotin.
3) Socket leak in l2tp, from Xin Long.
4) RSS context allocation fix in bnxt_en from Michael Chan.
5) Fix cxgb4 build errors, from Ganesh Goudar.
6) Route leaks in ipv6 when removing exceptions, from Xin Long.
7) Memory leak in IDR allocation handling of act_pedit, from Davide
Caratti.
8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.
9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
Sabrina Dubroca.
10) When NAPI cached skb is reused, we must set it to the proper initial
state which includes skb->pkt_type. From Eric Dumazet.
11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.
12) Set RX queue properly in various tuntap receive paths, from Matthew
Cover.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tuntap: fix multiqueue rx
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
tipc: don't assume linear buffer when reading ancillary data
tipc: fix lockdep warning when reinitilaizing sockets
net-gro: reset skb->pkt_type in napi_reuse_skb()
tc-testing: tdc.py: Guard against lack of returncode in executed command
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
ip_tunnel: don't force DF when MTU is locked
MAINTAINERS: Add entry for CAKE qdisc
net: bridge: fix vlan stats use-after-free on destruction
socket: do a generic_file_splice_read when proto_ops has no splice_read
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
net/sched: act_pedit: fix memory leak when IDR allocation fails
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
ipv6: fix a dst leak when removing its exception
net: mvneta: Don't advertise 2.5G modes
drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
net/mlx4: Fix UBSAN warning of signed integer overflow
...
After calling get_unlocked_entry(), you have to call
put_unlocked_entry() to avoid subsequent waiters losing wakeups.
Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Suspend fails due to the exec family of functions blocking the freezer.
The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
all sub-threads to die, and we have the deadlock if one of them is frozen.
This also can occur with the schedule() waiting for the group thread leader
to exit if it is frozen.
In our machine, it causes freeze timeout as bellows.
Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
Call trace:
[<ffffffc00008ed70>] __switch_to+0x88/0xa0
[<ffffffc000d1c30c>] __schedule+0x1bc/0x720
[<ffffffc000d1ca90>] schedule+0x40/0xa8
[<ffffffc0001cd784>] flush_old_exec+0xdc/0x640
[<ffffffc000220360>] load_elf_binary+0x2a8/0x1090
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc00021c584>] load_script+0x20c/0x228
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc0001ce8e0>] do_execveat_common.isra.14+0x4f8/0x6e8
[<ffffffc0001cedd0>] compat_SyS_execve+0x38/0x48
[<ffffffc00008de30>] el0_svc_naked+0x24/0x28
To fix this, make de_thread() freezable. It looks safe and works fine.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Chanho Min <chanho.min@lge.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit c26f6c6157 ("udf: Fix conversion of 'dstring' fields to UTF8")
started to be more strict when checking whether converted strings are
properly formatted. Sudip reports that there are DVDs where the volume
identification string is actually too long - UDF reports:
[ 632.309320] UDF-fs: incorrect dstring lengths (32/32)
during mount and fails the mount. This is mostly harmless failure as we
don't need volume identification (and even less volume set
identification) for anything. So just truncate the volume identification
string if it is too long and replace it with 'Invalid' if we just cannot
convert it for other reasons. This keeps slightly incorrect media still
mountable.
CC: stable@vger.kernel.org
Fixes: c26f6c6157 ("udf: Fix conversion of 'dstring' fields to UTF8")
Reported-and-tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix a static code checker warning:
fs/exportfs/expfs.c:171 reconnect_one() warn: passing zero to 'ERR_PTR'
The error path for lookup_one_len_unlocked failure
should set err to PTR_ERR.
Fixes: bbf7a8a356 ("exportfs: move most of reconnect_path to helper function")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spock reported that commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using xas_load() with a PMD-sized xa_state would work if either a
PMD-sized entry was present or a PTE sized entry was present in the
first 64 entries (of the 512 PTEs in a PMD on x86). If there was no
PTE in the first 64 entries, grab_mapping_entry() would believe there
were no entries present, allocate a PMD-sized entry and overwrite the
PTE in the page cache.
Use xas_find_conflict() instead which turns out to simplify
both get_unlocked_entry() and grab_mapping_entry(). Also remove a
WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered
in get_unlocked_entry().
Fixes: cfc93c6c6c ("dax: Convert dax_insert_pfn_mkwrite to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Device DAX PMD pages do not set the PageHead bit for compound pages.
Fix for now by retrieving the PMD bit from the entry, but eventually we
will be passed the page size by the caller.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
If the ioprio capability check fails, we return without putting
the file pointer.
Fixes: d9a08a9e61 ("fs: Add aio iopriority support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For the device-dax case, it is possible that the inode can go away
underneath us. The rcu_read_lock() was there to prevent it from
being freed, and not (as I thought) to protect the tree. Bring back
the rcu_read_lock() protection. Also add a little kernel-doc; while
this function is not exported to modules, it is used from outside dax.c
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
I wrote the semantics in the commit message, but didn't document it in
the source code. Use a BUG_ON instead (if any code does do this, it's
really buggy; we can't recover and it's worth taking the machine down).
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Skipping some of the revalidation after we sleep can lead to returning
a mapping which has already been freed. Just drop this optimisation.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvukbUACgkQnJ2qBz9k
QNkJnggAriAyO72fH7nfh2AzvwnkqLi9uE4JQk+ZxQFW9W6a32Y+qIRBsaiau++w
/lOVSGLCCVewzAa4b/lYPdyOS/yG4w6UuEsg5HQPux4JsOccalYudB5ONdeZGd/P
5FPxcsixDQQHP9mhz1gIn2wV7bbytH2DduIHxuKbJfX//xnO3jS/gnJq82rnxb4R
IzqewVjbFiqYQKtyuzIqKqI3yOVqESJhyQ96j6chj5YKNUkMuwU9akhQGR4Zcqen
/cxh+mK+ZzCybgu2Sfgx3noZb5RFYePpcSSpDG5OKJc04IY4m76fyHIi2flBE9AX
HCN60y+VyKFOiNxzsSLXgsE9aj40OA==
=UcJO
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fix from Jan Kara:
"One small fsnotify fix for duplicate events"
* tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: fix handling of events on child sub-directory
Pull bfs2 fixes from Andreas Gruenbacher:
"Fix two bugs leading to leaked buffer head references:
- gfs2: Put bitmap buffers in put_super
- gfs2: Fix iomap buffer head reference counting bug
And one bug leading to significant slow-downs when deleting large
files:
- gfs2: Fix metadata read-ahead during truncate (2)"
* tag 'gfs2-4.20.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix iomap buffer head reference counting bug
gfs2: Fix metadata read-ahead during truncate (2)
gfs2: Put bitmap buffers in put_super
GFS2 passes the inode buffer head (dibh) from gfs2_iomap_begin to
gfs2_iomap_end in iomap->private. It sets that private pointer in
gfs2_iomap_get. Users of gfs2_iomap_get other than gfs2_iomap_begin
would have to release iomap->private, but this isn't done correctly,
leading to a leak of buffer head references.
To fix this, move the code for setting iomap->private from
gfs2_iomap_get to gfs2_iomap_begin.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>