Combine all the boolean state flags in struct xfs_scrub into a single
unsigned int, because we're going to be adding more state flags soon.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
It's a little silly how the memset in scrub context initialization
forces us to declare stack variables to preserve context variables
across a retry. Since the teardown functions already null out most of
the ephemeral state (buffer pointers, btree cursors, etc.), just skip
the memset and move the initialization as needed.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Use space in the bulkstat ioctl structure to report any problems
observed with the inode.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Use the AG geometry info ioctl to report health status too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Use our newly expanded geometry structure to report the overall fs and
realtime health status.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add a new ioctl to describe an allocation group's geometry.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we
can't just add a new field to it. Hence we need to bump the definition
to V5 and and treat the V4 ioctl and structure similar to v1 to v3.
While doing this, clean up all the definitions associated with the
XFS_IOC_FSGEOMETRY ioctl.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: forward port to 5.1, expand structure size to 256 bytes]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If we know the filesystem metadata isn't healthy during unmount, we want
to encourage the administrator to run xfs_repair right away. We can't
do this if BAD_SUMMARY will cause an unclean log unmount to force
summary recalculation, so turn it off if the fs is bad.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Replace the BAD_SUMMARY mount flag with calls to the equivalent health
tracking code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add the necessary in-core metadata fields to keep track of which parts
of the filesystem have been observed and which parts were observed to be
unhealthy, and print a warning at unmount time if we have unfixed
problems.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This patch tries to address two problems:
1) return @minlen we used to trim to
user space.
2) return EINVAL if granularity is larger than
avg size, even most of cases, granularity is small(4K),
but if devices return a lager granularity for some reaons
(testing, bugs etc), fstrim should return failure directly.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The block allocation AG selection code has parameters that allow a
caller to perform multiple allocations from a single AG and
transaction (under certain conditions). The parameters specify the
total block allocation count required by the transaction and the AG
selection code selects and locks an AG that will be able to satisfy
the overall requirement. If the available block accounting
calculation turns out to be inaccurate and a subsequent allocation
call fails with -ENOSPC, the resulting transaction cancel leads to
filesystem shutdown because the transaction is dirty.
This exact problem can be reproduced with a highly parallel space
consumer and fsstress workload running long enough to a large
filesystem against -ENOSPC conditions. A bmbt block allocation
request made for inode extent to bmap format conversion after an
extent allocation is expected to be satisfied by the same AG and the
same transaction as the extent allocation. The bmbt block allocation
fails, however, because the block availability of the AG has changed
since the AG was selected (outside of the blocks used for the extent
itself).
The inconsistent block availability calculation is caused by the
deferred block freeing behavior of the AGFL. This immediately
removes extra blocks from the AGFL to free up AGFL slots, but rather
than immediately freeing such blocks as was done in the past, the
block free is deferred such that said blocks are not available for
allocation until the current transaction commits. The AG selection
logic currently considers all AGFL blocks as available and executes
shortly before any extra AGFL blocks are freed. This means the block
availability of the current AG can change before the first
allocation even occurs, but in practice a failure is more likely to
manifest via a subsequent allocation because extent allocation
usually has a contiguity requirement larger than a single block that
can't be satisfied from the AGFL.
In general, XFS prefers operational robustness to absolute
allocation efficiency. In other words, we prefer to return -ENOSPC
slightly earlier at the expense of not being able to allocate every
last block in an AG to avoid this kind of problem. As such, update
the AG block availability calculation to consider extra AGFL blocks
as unavailable since they are immediately removed following the
calculation and will not become available until the current
transaction commits.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If xfs_iflush_cluster() fails due to corruption, the error path
issues a shutdown and simulates an I/O completion to release the
buffer. This code has a couple small problems. First, the shutdown
sequence can issue a synchronous log force, which is unsafe to do
with buffer locks held. Second, the simulated I/O completion does not
guarantee the buffer is async and thus is unlocked and released.
For example, if the last operation on the buffer was a read off disk
prior to the corruption event, XBF_ASYNC is not set and the buffer
is left locked and held upon return. This results in a memory leak
as shown by the following message on module unload:
BUG xfs_buf (...): Objects remaining in xfs_buf on __kmem_cache_shutdown()
Fix both of these problems by setting XBF_ASYNC on the buffer prior
to the simulated I/O error and performing the shutdown immediately
after ioend processing when the buffer has been released.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS shutdown deadlocks have been reproduced by fstest generic/475.
The deadlock signature involves log I/O completion running error
handling to abort logged items and waiting for an inode cluster
buffer lock in the buffer item unpin handler. The buffer lock is
held by xfsaild attempting to flush an inode. The buffer happens to
be pinned and so xfs_iflush() triggers an async log force to begin
work required to get it unpinned. The log force is blocked waiting
on the commit completion, which never occurs and thus leaves the
filesystem deadlocked.
The root problem is that aborted log I/O completion pots commit
completion behind callback completion, which is unexpected for async
log forces. Under normal running conditions, an async log force
returns to the caller once the CIL ctx has been formatted/submitted
and the commit completion event triggered at the tail end of
xlog_cil_push(). If the filesystem has shutdown, however, we rely on
xlog_cil_committed() to trigger the completion event and it happens
to do so after running log item unpin callbacks. This makes it
unsafe to invoke an async log force from contexts that hold locks
that might also be required in log completion processing.
To address this problem, wake commit completion waiters before
aborting log items in the log I/O completion handler. This ensures
that an async log force will not deadlock on held locks if the
filesystem happens to shutdown. Note that it is still unsafe to
issue a sync log force while holding such locks because a sync log
force explicitly waits on the force completion, which occurs after
log I/O completion processing.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer
is unlocked when either non-stale or aborted. This assert occurs
after the bli refcount has been dropped and the log item potentially
freed. The aborted check is thus a potential use after free. This
problem has been reproduced with KASAN enabled via generic/475.
Fix up xfs_buf_item_unlock() to query aborted state before the bli
reference is dropped to prevent a potential use after free.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge page ref overflow branch.
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyw354QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiMyEAC4THUReCrTuv9oFRNg5uILVYIq51nP8dw7
XamC7A92jPXd6vl/QVjmvLwT34/Y2XvX0t62RBsk849CEjgGYTeF1/qI3tMkpN7c
huupab3aYM/Rrv4i1KSPQu6iIto3DYqfmREaGJJ1Ikbu/CKDuUGyEo+Z4wrKUPon
GWnE8QMS2fdc764eVzKKqB+GryaEiHmeD1N4NnPs+nla14ysueUvJUikkTt/Laef
h7nOmz9mrqE6u1xVHNpo0TlW0oJdLfaDIL9ghwHFJXqvriTh8Tg2tEHpXI6vSTTt
StnPbTA1s1uhHs4rWYl8J0UXSZnRRp0Ep8jCvqEb9CJ23uHCNyGEoy/R7q+x2quf
T+ruolMXY7IIJP30ZMHar374YfajJdw7EH/565nlbLnjSBXhqjmc07kQ7mIYvpg6
JgureSdDwOOHpfrJgVq5es48ndt5HBYUBPzkvVGTgkeSJkMydkkM1qZeYEnai105
8EnUFusRUnYZtb73HBPjKS7i0BZZvZlI1oKYHabiMtajqcKyvwDP2tTmhqXYLDLY
9uloW0u2B0lddfzCb9hTYZOroNWfifo4vuSU5DHvnJoKvf4z3auDxaFD9N8fGn6S
aZsRjMCpFqFd0YEnZPbsctgPg2Licrs02uPntlzBTJ0ByH20pX4OepYrvgQk3vao
tOQ1jRYMKw==
=cISy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Set of fixes that should go into this round. This pull is larger than
I'd like at this time, but there's really no specific reason for that.
Some are fixes for issues that went into this merge window, others are
not. Anyway, this contains:
- Hardware queue limiting for virtio-blk/scsi (Dongli)
- Multi-page bvec fixes for lightnvm pblk
- Multi-bio dio error fix (Jason)
- Remove the cache hint from the io_uring tool side, since we didn't
move forward with that (me)
- Make io_uring SETUP_SQPOLL root restricted (me)
- Fix leak of page in error handling for pc requests (Jérôme)
- Fix BFQ regression introduced in this merge window (Paolo)
- Fix break logic for bio segment iteration (Ming)
- Fix NVMe cancel request error handling (Ming)
- NVMe pull request with two fixes (Christoph):
- fix the initial CSN for nvme-fc (James)
- handle log page offsets properly in the target (Keith)"
* tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
block: fix the return errno for direct IO
nvmet: fix discover log page when offsets are used
nvme-fc: correct csn initialization and increments on error
block: do not leak memory in bio_copy_user_iov()
lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
nvme: cancel request synchronously
blk-mq: introduce blk_mq_complete_request_sync()
scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
virtio-blk: limit number of hw queues by nr_cpu_ids
block, bfq: fix use after free in bfq_bfqq_expire
io_uring: restrict IORING_SETUP_SQPOLL to root
tools/io_uring: remove IOCQE_FLAG_CACHEHIT
block: don't use for-inside-for in bio_for_each_segment_all
Highlights include:
Stable fixes:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return values
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcsfeVAAoJEA4mA3inWBJcPjAQAIPERRVWjg7xRz6CJzt2yoM1
ApPj965DCnC9bGcGAH2U+TbCWJOi3lJwaZOPTL0ut/Tcv9PpKETRqk+rrjUcFRy1
1b1HH16GivprOmHgCRyqo5Qj2ZiaGNpY3tJfxl/6eIiSpHKPZLa4zY+q2KfK/YNI
SOVyNU0Gq08p4AiKr3CG5VVZGdNgRMrnzBYJqeTh1zZ7erWE2nJoE+pmvcLhZR0w
uxshbTWbJT21KLEI+PXTyGtFkz5jNaKy4Ts07MRBJdQjDv73MUW8CcqFZicSjtqx
zdKYa1VH9pEOjFOs57xGELSnYRdB00Vgd9/b6MqKyWH8iJzXFbgjEusMWiU45aeF
NLg9ySSU8LeY93SxV66CHG57NIgHqwZu6P+lO3efRzuHgEGceDsz0WwDF2KNIZlm
/vOmbk0I+woneFUeNDWAXD9/ETUJ8RCNk1/b1UlbkUL7aD5WSLDp1bKPifk/WA6E
Mtgwmqz1Vso3cIPglWcAgsfEAYJZSJVDMfRIhm2dy7vVU0nfW12I00G8BShgr8f7
mxAxd/V+1/Q9ftPENgC9z5LWKYQjfjksnYRHXW1m5c92Yoe9TF0yiNyDmT5hBR6w
MvUN2j3yeQBqk6JHZxtH/mmdSRD0o5kxvFrEqMj1PpP8X8DpWupQA8SZKnHq0wlj
8Q7LRum+wmhbiKCmZ+1F
=vRPB
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fix:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be
sleeping" as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped
working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return
values"
* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
NFSv4.1 fix incorrect return value in copy_file_range
xprtrdma: Fix helper that drains the transport
NFS: Fix handling of reply page vector
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
If the last bio returned is not dio->bio, the status of the bio will
not assigned to dio->bio if it is error. This will cause the whole IO
status wrong.
ksoftirqd/21-117 [021] ..s. 4017.966090: 8,0 C N 4883648 [0]
<idle>-0 [018] ..s. 4017.970888: 8,0 C WS 4924800 + 1024 [0]
<idle>-0 [018] ..s. 4017.970909: 8,0 D WS 4935424 + 1024 [<idle>]
<idle>-0 [018] ..s. 4017.970924: 8,0 D WS 4936448 + 321 [<idle>]
ksoftirqd/21-117 [021] ..s. 4017.995033: 8,0 C R 4883648 + 336 [65475]
ksoftirqd/21-117 [021] d.s. 4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
ksoftirqd/21-117 [021] d.s. 4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0
We always have to assign bio->bi_status to dio->bio.bi_status because we
will only check dio->bio.bi_status when we return the whole IO to
the upper layer.
Fixes: 542ff7bf18 ("block: new direct I/O implementation")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlyveOkACgkQxWXV+ddt
WDvb3g//RTIy+8o1PwPihGT9z0wWaFTxJF9Ea3SDzgMVsOaWzZIM29Q0XCwGyR5s
xdwJlNrum4eJwcLRuvvZtZ6e9h6vdgmNxi7ULem0r2ik58rvgZf3TQp/t78qMMLo
Z8Qp/jQPHOOhwGURFsUd2TwCYQ+3yyEhzoObDDQ5OAyeCgneLe2hLNvyMMF7YNkl
joaWY5iAwDY61UaRxggwx8OI7TkhCA/ZA27zyjc6oomCQglIM/KmdvmjHpiBs0j/
1Ij4SDmSo9nqGES/SfubW/l3fpg42hoQWBMuI3/WLr3CBKN08wuRe+BKoDmVyoex
eVTy3+AnBp6KsjyOmN9h5Am+r1lyToJ1ZpsjkKQzo5SRlYC/SA0oFlxhHKygoft6
tEEJf+hbySiod/ZX0KItS6Myo1xsHsX8LnidAO+7pK0S5e5D59QPXtw6oJIp0JX/
kAKrng6bX3+7bSrF8h62nKqpFq/NUTjF8zxB6gwMwnrtroU36r8AFE4+bcyX5Z5g
+JoJcZ9VFsFcA4GCTzYWTYQ2RfCU6Vnbvh4wTLHiR+IDvbkNFxMUE4z5O+fwqhkl
6nv8G2EgK3ORZS/4mNAjmanB71gPwbovhTQhju8LQGEmlmwBrF1mQ+YhrBlMrMv9
XspCzNktqzbGj70tMPbPSB2A5H9oi+5Oabzq2MPFUVBkA9ztpSA=
=fYo9
-----END PGP SIGNATURE-----
Merge tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix parsing of compression algorithm when set as a inode property,
this could end up with eg. 'zst' or 'zli' in the value
- don't allow trim on a filesystem with unreplayed log, this could
cause data loss if there are pending updates to the block groups that
would not be subject to trim after replay
* tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prop: fix vanished compression property after failed set
btrfs: prop: fix zstd compression parameter validation
Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
According to the NFSv4.2 spec if the input and output file is the
same file, operation should fail with EINVAL. However, linux
copy_file_range() system call has no such restrictions. Therefore,
in such case let's return EOPNOTSUPP and allow VFS to fallback
to doing do_splice_direct(). Also when copy_file_range is called
on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support
COPY functionality), we also need to return EOPNOTSUPP and
fallback to a regular copy.
Fixes xfstest generic/075, generic/091, generic/112, generic/263
for all NFSv4.x versions.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NFSv4 GETACL and FS_LOCATIONS requests stopped working in v5.1-rc.
These two need the extra padding to be added directly to the reply
length.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
syzbot is reporting uninitialized value at rpc_sockaddr2uaddr() [1]. This
is because syzbot is setting AF_INET6 to "struct sockaddr_in"->sin_family
(which is embedded into user-visible "struct nfs_mount_data" structure)
despite nfs23_validate_mount_data() cannot pass sizeof(struct sockaddr_in6)
bytes of AF_INET6 address to rpc_sockaddr2uaddr().
Since "struct nfs_mount_data" structure is user-visible, we can't change
"struct nfs_mount_data" to use "struct sockaddr_storage". Therefore,
assuming that everybody is using AF_INET family when passing address via
"struct nfs_mount_data"->addr, reject if its sin_family is not AF_INET.
[1] https://syzkaller.appspot.com/bug?id=599993614e7cbbf66bc2656a919ab2a95fb5d75c
Reported-by: syzbot <syzbot+047a11c361b872896a4f@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull misc fixes from Al Viro:
"A few regression fixes from this cycle"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: use kmem_cache_free() instead of kfree()
iov_iter: Fix build error without CONFIG_CRYPTO
aio: Fix an error code in __io_submit_one()
This options spawns a kernel side thread that will poll for submissions
(and completions, if IORING_SETUP_IOPOLL is set). As this allows a user
to potentially use more cycles outside of the normal hierarchy,
restrict the use of this feature to root.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyqUMgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOSEACCRDhrbXsfVdMFAuRzmXjVqELuJ0Sk3zfC
0OR9EcNpr4s5Wp5ztX6xxcddvnvB8LZhAzs1R9JuiYG4EIHGP7CMBZ0JJlcygJkx
nVM1p7bSl7H/zDVKF8KMj/J7rjwXfY9FMKAopiFVSkS0cA1oz+PK96cDR8m2xeuV
l0b6zgorjmNpn3TukEbFjvAjqskKhm8Xtjn5/wBGeWUnqZE9AZeI9OovuK5BOSBm
qAs7lVB+MACtpbSjv4yWGcfwtqYUt9PbnsTog95uXXQDR1BPnv/btjeGdzpVtNH1
iiueCXR3bNqnoBo6MLgzWpnvA6UHcygXOTmRy17BoNg7uqtWiFxZn0HKxMOUYD6F
RU4RP7AVwpZeziMO8I7VkdfasgiKGetDzm8vCJ4QtKly/+3iwMVVKHPnU7nV/cCm
EmoqM5BLAT6hHuSxGaNBVVNavvr/CFcqjk+29UEnK8ZQ4c/Mkgwgc6gPbq59lTLN
Kn0AeB2kDeOvpJ5LWOjVmy7vfVQ3um65ohNl9KvtZZJsX3xQoIaH+i70YE+zpOHT
czKZ9ZC7HPIJuanPoEbGz/c+Js5un4/Rn+Ry9fa/3k3IFcd9N2bOc/AIm5LiAm3I
FmSonn+SWgLGlwhiZZBHB45za0Wwq6AGGyTQpyT/ijjX9ouHBzb94iRWOH9htZuF
JZDjpyRqEw==
=8et7
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190407' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Fixups for the pf/pcd queue handling (YueHaibing)
- Revert of the three direct issue changes as they have been proven to
cause an issue with dm-mpath (Bart)
- Plug rq_count reset fix (Dongli)
- io_uring double free in fileset registration error handling (me)
- Make null_blk handle bad numa node passed in (John)
- BFQ ifdef fix (Konstantin)
- Flush queue leak fix (Shenghui)
- Plug trace fix (Yufen)
* tag 'for-linus-20190407' of git://git.kernel.dk/linux-block:
xsysace: Fix error handling in ace_setup
null_blk: prevent crash from bad home_node value
block: Revert v5.0 blk_mq_request_issue_directly() changes
paride/pcd: Fix potential NULL pointer dereference and mem leak
blk-mq: do not reset plug->rq_count before the list is sorted
paride/pf: Fix potential NULL pointer dereference
io_uring: fix double free in case of fileset regitration failure
blk-mq: add trace block plug and unplug for multiple queues
block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx
block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y
Commit 9c225f2655 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.
This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d0 ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.
The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.
However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655 -
is doing so irregardless of whether a file is seekable or not.
See
https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/https://lwn.net/Articles/180387https://lwn.net/Articles/180396
for historic context.
The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:
kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter
...
Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.
Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):
drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:
See
https://github.com/libfuse/osspdhttps://lwn.net/Articles/308445https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:
https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
Let's fix this regression. The plan is:
1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.
2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.
3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.
4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.
It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3Dhttps://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).
This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.
This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.
Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.
The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"14 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kernel/sysctl.c: fix out-of-bounds access when setting file-max
mm/util.c: fix strndup_user() comment
sh: fix multiple function definition build errors
MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM
MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM
mm: writeback: use exact memcg dirty counts
psi: clarify the units used in pressure files
mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
hugetlbfs: fix memory leak for resv_map
mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()
lib/lzo: fix bugs for very short or empty input
include/linux/bitrev.h: fix constant bitrev
kmemleak: powerpc: skip scanning holes in the .bss section
lib/string.c: implement a basic bcmp
When mknod is used to create a block special file in hugetlbfs, it will
allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
inode->i_mapping->private_data will point the newly allocated resv_map.
However, when the device special file is opened bd_acquire() will set
inode->i_mapping to bd_inode->i_mapping. Thus the pointer to the
allocated resv_map is lost and the structure is leaked.
Programs to reproduce:
mount -t hugetlbfs nodev hugetlbfs
mknod hugetlbfs/dev b 0 0
exec 30<> hugetlbfs/dev
umount hugetlbfs/
resv_map structures are only needed for inodes which can have associated
page allocations. To fix the leak, only allocate resv_map for those
inodes which could possibly be associated with page allocations.
Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Yufen Yu <yuyufen@huawei.com>
Suggested-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
implementation in x86 was horrible and gcc certainly gets it wrong. He
said that since the tracepoints only pass in 0 and 6 for i and n repectively,
it should be optimized for that case. Inspecting the kernel, I discovered
that all users pass in 0 for i and only one file passing in something other
than 6 for the number of arguments. That code happens to be my own code used
for the special syscall tracing. That can easily be converted to just
using 0 and 6 as well, and only copying what is needed. Which is probably
the faster path anyway for that case.
Along the way, a couple of real fixes came from this as the
syscall_get_arguments() function was incorrect for csky and riscv.
x86 has been optimized to for the new interface that removes the variable
number of arguments, but the other architectures could still use some
loving and take more advantage of the simpler interface.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXKdi7RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjtiAQDaZbFaSgEbs99jjuAPDSZ0li8dyUOC
3KS5TyuLw+fEaAD/QZnKjplVFAfA5FxrABZ0ioIKDON4nLyESEb+xCv0gA4=
=dTuo
-----END PGP SIGNATURE-----
Merge tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull syscall-get-arguments cleanup and fixes from Steven Rostedt:
"Andy Lutomirski approached me to tell me that the
syscall_get_arguments() implementation in x86 was horrible and gcc
certainly gets it wrong.
He said that since the tracepoints only pass in 0 and 6 for i and n
repectively, it should be optimized for that case. Inspecting the
kernel, I discovered that all users pass in 0 for i and only one file
passing in something other than 6 for the number of arguments. That
code happens to be my own code used for the special syscall tracing.
That can easily be converted to just using 0 and 6 as well, and only
copying what is needed. Which is probably the faster path anyway for
that case.
Along the way, a couple of real fixes came from this as the
syscall_get_arguments() function was incorrect for csky and riscv.
x86 has been optimized to for the new interface that removes the
variable number of arguments, but the other architectures could still
use some loving and take more advantage of the simpler interface"
* tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
syscalls: Remove start and number from syscall_set_arguments() args
syscalls: Remove start and number from syscall_get_arguments() args
csky: Fix syscall_get_arguments() and syscall_set_arguments()
riscv: Fix syscall_get_arguments() and syscall_set_arguments()
tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()
ptrace: Remove maxargs from task_current_syscall()
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().
Fixes: fa0ca2aee3 ("deal with get_reqs_available() in aio_get_req() itself")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The compression property resets to NULL, instead of the old value if we
fail to set the new compression parameter.
$ btrfs prop get /btrfs compression
compression=lzo
$ btrfs prop set /btrfs compression zli
ERROR: failed to set compression for /btrfs: Invalid argument
$ btrfs prop get /btrfs compression
This is because the compression property ->validate() is successful for
'zli' as the strncmp() used the length passed from the userspace.
Fix it by using the expected string length in strncmp().
Fixes: 63541927c8 ("Btrfs: add support for inode properties")
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We let pass zstd compression parameter even if it is not fully valid.
For example:
$ btrfs prop set /btrfs compression zst
$ btrfs prop get /btrfs compression
compression=zst
zlib and lzo are fine.
Fix it by checking the correct prefix length.
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
task_current_syscall() has a single user that passes in 6 for maxargs, which
is the maximum arguments that can be used to get system calls from
syscall_get_arguments(). Instead of passing in a number of arguments to
grab, just get 6 arguments. The args argument even specifies that it's an
array of 6 items.
This will also allow changing syscall_get_arguments() to not get a variable
number of arguments, but always grab 6.
Linus also suggested not passing in a bunch of arguments to
task_current_syscall() but to instead pass in a pointer to a structure, and
just fill the structure. struct seccomp_data has almost all the parameters
that is needed except for the stack pointer (sp). As seccomp_data is part of
uapi, and I'm afraid to change it, a new structure was created
"syscall_info", which includes seccomp_data and adds the "sp" field.
Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This accidentally returns the wrong variable. The "req->ki_eventfd"
pointer is NULL so this return success.
Fixes: 7316b49c2a ("aio: move sanity checks and request allocation to io_submit_one()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Will Deacon reported the following KASAN complaint:
[ 149.890370] ==================================================================
[ 149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140
[ 149.892218]
[ 149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G B 5.1.0-rc3-00012-g40b114779944 #3
[ 149.893623] Hardware name: linux,dummy-virt (DT)
[ 149.894169] Call trace:
[ 149.894539] dump_backtrace+0x0/0x228
[ 149.895172] show_stack+0x14/0x20
[ 149.895747] dump_stack+0xe8/0x124
[ 149.896335] print_address_description+0x60/0x258
[ 149.897148] kasan_report_invalid_free+0x78/0xb8
[ 149.897936] __kasan_slab_free+0x1fc/0x228
[ 149.898641] kasan_slab_free+0x10/0x18
[ 149.899283] kfree+0x70/0x1f8
[ 149.899798] io_sqe_files_unregister+0xa8/0x140
[ 149.900574] io_ring_ctx_wait_and_kill+0x190/0x3c0
[ 149.901402] io_uring_release+0x2c/0x48
[ 149.902068] __fput+0x18c/0x510
[ 149.902612] ____fput+0xc/0x18
[ 149.903146] task_work_run+0xf0/0x148
[ 149.903778] do_notify_resume+0x554/0x748
[ 149.904467] work_pending+0x8/0x10
[ 149.905060]
[ 149.905331] Allocated by task 3974:
[ 149.905934] __kasan_kmalloc.isra.0.part.1+0x48/0xf8
[ 149.906786] __kasan_kmalloc.isra.0+0xb8/0xd8
[ 149.907531] kasan_kmalloc+0xc/0x18
[ 149.908134] __kmalloc+0x168/0x248
[ 149.908724] __arm64_sys_io_uring_register+0x2b8/0x15a8
[ 149.909622] el0_svc_common+0x100/0x258
[ 149.910281] el0_svc_handler+0x48/0xc0
[ 149.910928] el0_svc+0x8/0xc
[ 149.911425]
[ 149.911696] Freed by task 3974:
[ 149.912242] __kasan_slab_free+0x114/0x228
[ 149.912955] kasan_slab_free+0x10/0x18
[ 149.913602] kfree+0x70/0x1f8
[ 149.914118] __arm64_sys_io_uring_register+0xc2c/0x15a8
[ 149.915009] el0_svc_common+0x100/0x258
[ 149.915670] el0_svc_handler+0x48/0xc0
[ 149.916317] el0_svc+0x8/0xc
[ 149.916817]
[ 149.917101] The buggy address belongs to the object at ffff8004ce07ed00
[ 149.917101] which belongs to the cache kmalloc-128 of size 128
[ 149.919197] The buggy address is located 0 bytes inside of
[ 149.919197] 128-byte region [ffff8004ce07ed00, ffff8004ce07ed80)
[ 149.921142] The buggy address belongs to the page:
[ 149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0
[ 149.923595] flags: 0x1ffff00000010200(slab|head)
[ 149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00
[ 149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
[ 149.927011] page dumped because: kasan: bad access detected
[ 149.927956]
[ 149.928224] Memory state around the buggy address:
[ 149.929054] ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 149.930274] ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 149.932712] ^
[ 149.933281] ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 149.934508] ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 149.935725] ==================================================================
which is due to a failure in registrering a fileset. This frees the
ctx->user_files pointer, but doesn't clear it. When the io_uring
instance is later freed through the normal channels, we free this
pointer again. At this point it's invalid.
Ensure we clear the pointer when we free it for the error case.
Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It only means that we do not have a valid cached value for the
file_all_info structure.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reconnecting after server or network failure can be improved
(to maintain availability and protect data integrity) by allowing
the client to choose the default persistent (or resilient)
handle timeout in some use cases. Today we default to 0 which lets
the server pick the default timeout (usually 120 seconds) but this
can be problematic for some workloads. Add the new mount parameter
to cifs.ko for SMB3 mounts "handletimeout" which enables the user
to override the default handle timeout for persistent (mount
option "persistenthandles") or resilient handles (mount option
"resilienthandles"). Maximum allowed is 16 minutes (960000 ms).
Units for the timeout are expressed in milliseconds. See
section 2.2.14.2.12 and 2.2.31.3 of the MS-SMB2 protocol
specification for more information.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Some servers (see MS-SMB2 protocol specification
section 3.3.5.15.1) expect that the FSCTL enumerate snapshots
is done twice, with the first query having EXACTLY the minimum
size response buffer requested (16 bytes) which refreshes
the snapshot list (otherwise that and subsequent queries get
an empty list returned). So had to add code to set
the maximum response size differently for the first snapshot
query (which gets the size needed for the second query which
contains the actual list of snapshots).
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> # 4.19+
Fix a bug where we used to not initialize the cached fid structure at all
in open_shroot() if the open was successful but we did not get a lease.
This would leave the structure uninitialized and later when we close the handle
we would in close_shroot() try to kref_put() an uninitialized refcount.
Fix this by always initializing this structure if the open was successful
but only do the extra get() if we got a lease.
This extra get() is only used to hold the structure until we get a lease
break from the server at which point we will kref_put() it during lease
processing.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Pull aio race fixes and cleanups from Al Viro.
The aio code had more issues with error handling and races with the aio
completing at just the right (wrong) time along with freeing the file
descriptor when another thread closes the file.
Just a couple of these commits are the actual fixes: the others are
cleanups to either make the fixes simpler, or to make the code legible
and understandable enough that we hope there's no more fundamental races
hiding.
* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: move sanity checks and request allocation to io_submit_one()
deal with get_reqs_available() in aio_get_req() itself
aio: move dropping ->ki_eventfd into iocb_destroy()
make aio_read()/aio_write() return int
Fix aio_poll() races
aio: store event at final iocb_put()
aio: keep io_event in aio_kiocb
aio: fold lookup_kiocb() into its sole caller
pin iocb through aio.
Pull symlink fixes from Al Viro:
"The ceph fix is already in mainline, Daniel's bpf fix is in bpf tree
(1da6c4d914 "bpf: fix use after free in bpf_evict_inode"), the rest
is in here"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
debugfs: fix use-after-free on symlink traversal
ubifs: fix use-after-free on symlink traversal
jffs2: fix use-after-free on symlink traversal
symlink body shouldn't be freed without an RCU delay. Switch debugfs to
->destroy_inode() and use of call_rcu(); free both the inode and symlink
body in the callback. Similar to solution for bpf, only here it's even
more obvious that ->evict_inode() can be dropped.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge misc fixes from Andrew Morton:
"22 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
fs: fs_parser: fix printk format warning
checkpatch: add %pt as a valid vsprintf extension
mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
drivers/block/zram/zram_drv.c: fix idle/writeback string compare
mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
mm/memory_hotplug.c: fix notification in offline error path
ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK
fs/proc/kcore.c: make kcore_modules static
include/linux/list.h: fix list_is_first() kernel-doc
mm/debug.c: fix __dump_page when mapping->host is not set
mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
include/linux/hugetlb.h: convert to use vm_fault_t
iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
mm: add support for kmem caches in DMA32 zone
ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
mm/hotplug: fix offline undo_isolate_page_range()
fs/open.c: allow opening only regular files during execve()
mailmap: add Changbin Du
mm/debug.c: add a cast to u64 for atomic64_read()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyeQn8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnqwD/0bqoixqUEicnpvCE8V6eze3HYHK0T8jWtr
32hZXWMihtZpDBq4LXWWJOjHevOP2+NN0uvJDtwhvJAaJM+Xfg/Yh2iPWHYn40rI
tjtVoszBA+w50EyCG8u+JjmYPxdgmwIfowkGiYf7ZJbY8LQqXQQCVzwjjJjbmBAZ
XrbJRPl6HFNGMA4cHoL+beHK5kgKwi+V0LMRNjoigE9J129Co6fyjJRw1cC+IHvP
DPb/Lncjzzuy59fIGXSfRcbs43vHQncLS2DdzsISkTgKlnB52rh7XPlvp2JxvN+N
ReTblAeq2CJAQoSijmPh2/qwhiRm7OWmw54dkE6gRveJUFmjV9u+Pyf1c68kMz83
kGOQqobYuzL95UJYJTxQV4988bqqrnboimjARUGosagcYy0vQHNUnEODlWToZCqO
uGwGfPWALi9CNkfJm5rSH0VcXUytmzm0BHg+haal9LKfHOdgeBQcnex3O1RiBBI2
PLW1sF4VGgpLQuGFwNZM3yVpXhQl7QO8cbN7/qD2xby1Rn/8d/Zk0yCKqONNq9tt
jmQiVvA47DiuOUQWVQduB0qaYn/vYv0uvw6BLMUzPfX9wSG/j1COSGBtl0XmrU5D
a8woZwWyYbu/diqB9QdbWTEoqKfPWQY1NQSafH3FYAkuFVQtdrIFdALdjbwf16Rt
jkWltGv1Fw==
=3chO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190329' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Small set of fixes that should go into this series. This contains:
- compat signal mask fix for io_uring (Arnd)
- EAGAIN corner case for direct vs buffered writes for io_uring
(Roman)
- NVMe pull request from Christoph with various little fixes
- sbitmap ws_active fix, which caused a perf regression for shared
tags (me)
- sbitmap bit ordering fix (Ming)
- libata on-stack DMA fix (Raymond)"
* tag 'for-linus-20190329' of git://git.kernel.dk/linux-block:
nvmet: fix error flow during ns enable
nvmet: fix building bvec from sg list
nvme-multipath: relax ANA state check
nvme-tcp: fix an endianess miss-annotation
libata: fix using DMA buffers on stack
io_uring: offload write to async worker in case of -EAGAIN
sbitmap: order READ/WRITE freed instance and setting clear bit
blk-mq: fix sbitmap ws_active for shared tags
io_uring: fix big-endian compat signal mask handling
blk-mq: update comment for blk_mq_hctx_has_pending()
blk-mq: use blk_mq_put_driver_tag() to put tag
a small use-after-free fix.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlyeRsgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8i6B/9wP90ZLGzdAZDIlfWKXjGB1PUrFdeN
WCA5p68Hl7yh1RbY6cvbZcTF5Bo3DhjxjxTFjXHPXLxsARlxbCXon9R6Lo2lDgA4
Bk/W8dcR3onU3nspifG91Him/WnImWB80pyVgZog2PTiwsZJ0rRknXXbRU9ARCpk
8vjg19O4wHwXgtMXAN3vxjQ7v8T8wk8vDb08efPcmMPLDYMaTUL1z2JoqyRfMTbo
OpZoXSjHXqVFfz0mJ5EN7+92eK39oDcQIDSuuqePDCI09ZmrcQd/xSvG5tBfPoXr
1mR3ojkKRURW5RKGClbSoAt90vIuYJH5Cncmemzsr6m4FETH6XthGbJl
=twzl
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to avoid choking on multipage bvecs in the messenger and a
small use-after-free fix"
* tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client:
ceph: fix use-after-free on symlink traversal
libceph: fix breakage caused by multipage bvecs
- Fix a bunch of static checker complaints about uninitialized variables
and insufficient range checks.
- Avoid a crash when incore extent map data are corrupt.
- Disallow FITRIM when we haven't recovered the log and know the
metadata are stale.
- Fix a data corruption when doing unaligned overlapping dio writes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlyaSCwACgkQ+H93GTRK
tOsBSRAAoD6npxZjzApGk7y0y2d+8+/f3BBXdyOHhzg8G/VTcVW+ZQsVibeXEYYm
d02iu3RCQ3AsJVN3Z2FUgAUkf+2duS6QWJH6hL29+fn9aeHb8CYDtlZU9uW6Mf2K
DKuWR3v3aesXEKzL8DVbJa825UWy3fyfggQWvRUvMD+uO/Td2gZEpUSQeBLAUFMZ
4Yj0q1zjWVfi3lcsQDY+gsL3+8hGBD4YldyoX8eUCI78/WMeXzwP4WECNnSBfmM7
Ke63AniGKeAkAMX0PtwiOTITjD6c2Msa9jbriSdUSkX1xnnq5CDbqQHJ7sEefyYT
ff8INci0hL/8kZx63CjrpNZQ5hB5+rIusz2tScmJ/hBnGtAMLg8Duq98ZmQSlSOy
fVV1L+roDGRHO+SEaF4xko2dwMu4iSJmGW50PrXjCJdCgZ7tBaL87k5GQ/W1A0KX
EFje3OPBbGYKHdPdk0TqRoIs2qgOuAYERlLZWcgLLscnOp7XwhgSrvwThV7I7TNB
eu8+xEH7H3V+BHa+OuLgLDFklj1UhyQR8DLKXs/j+DyhD1f5xh6sXVnVhNAZdhbU
OLlgjKT9BkfIsNOgWcjg9SO2EoU/Oi3InDkNz8mSebFpixEG+bvXyguzB+Y2IgA8
8btKHyLOnxJJ1Zb4dnZLFgVWV3QMUip4AlFBXSkzOefDznjGPms=
=iNqS
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.1-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a few fixes for some corruption bugs and uninitialized
variable problems. The few patches here have gone through a few days
worth of fstest runs with no new problems observed.
Changes since last update:
- Fix a bunch of static checker complaints about uninitialized
variables and insufficient range checks.
- Avoid a crash when incore extent map data are corrupt.
- Disallow FITRIM when we haven't recovered the log and know the
metadata are stale.
- Fix a data corruption when doing unaligned overlapping dio writes"
* tag 'xfs-5.1-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: serialize unaligned dio writes against all other dio writes
xfs: prohibit fstrim in norecovery mode
xfs: always init bma in xfs_bmapi_write
xfs: fix btree scrub checking with regards to root-in-inode
xfs: dabtree scrub needs to range-check level
xfs: don't trip over uninitialized buffer on extent read of corrupted inode