While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.
This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.
As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:
1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling
The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.
Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.
Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We had no more use for this flag after the conversion to io-wq, kill it
off.
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if
REQ_F_LINK is set. Any failure in the chain should break the chain.
We were missing a few spots where this should be done. It might be nice
to generalize this somewhat at some point, as long as we factor in the
fact that failure looks different for each request type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As introduced by commit:
ba816ad61f ("io_uring: run dependent links inline if possible")
enable inline dependent link running for poll commands.
io_poll_complete_work() is the most important change, as it allows a
linked sequence of { POLL, READ } (for example) to proceed inline
instead of needing to get punted to another async context. The
submission side only potentially matters for sqthread, but may as well
include that bit.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently don't have a completion event trace, add one of those. And
to better be able to match up submissions and completions, add user_data
to the submission trace as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We didn't use -ERESTARTSYS to tell the application layer to restart the
system call, but instead return -EINTR. we can set -EINTR directly when
wakeup by the signal, which can help us save an assignment operation and
comparison operation.
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.
To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.
Reset ->result whenever we re-prep a request for polled submission.
Cc: stable@vger.kernel.org
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot reported an issue where we crash at setup time if failslab is
used. The issue is that io_wq_create() returns an error pointer on
failure, not NULL. Hence io_uring thought the io-wq was setup just
fine, but in reality it's a garbage error pointer.
Use IS_ERR() instead of a NULL check, and assign ret appropriately.
Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we get -1 from hrtimer_try_to_cancel(), we know that the timer
is running. Hence leave all completion to the timeout handler. If
we don't, we can corrupt the list and miss a completion.
Fixes: 11365043e5 ("io_uring: add support for canceling timeout requests")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's been a few requests for supporting more fixed files than 1024.
This isn't really tricky to do, we just need to split up the file table
into multiple tables and index appropriately. As we do so, reduce the
max single file table to 512. This enables us to do single page allocs
always for the tables, which is an improvement over the situation prior.
This patch adds support for up to 64K files, which should be enough for
everyone.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We index the file tables with a user given value. After we check
it's within our limits, use array_index_nospec() to prevent any
spectre attacks here.
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for adding opcodes that need to add new files
in a process file table, system calls like open(2) or accept4(2).
If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
item. If work that needs to get punted to async context have this
set, the async worker will assume the original task file table before
executing the work.
Note that opcodes that need access to the current files of an
application cannot be done through IORING_SETUP_SQPOLL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop various work-arounds we have for workqueues:
- We no longer need the async_list for tracking sequential IO.
- We don't have to maintain our own mm tracking/setting.
- We don't need a separate workqueue for buffered writes. This didn't
even work that well to begin with, as it was suboptimal for multiple
buffered writers on multiple files.
- We can properly cancel pending interruptible work. This fixes
deadlocks with particularly socket IO, where we cannot cancel them
when the io_uring is closed. Hence the ring will wait forever for
these requests to complete, which may never happen. This is different
from disk IO where we know requests will complete in a finite amount
of time.
- Due to being able to cancel work interruptible work that is already
running, we can implement file table support for work. We need that
for supporting system calls that add to a process file table.
- It gets us one step closer to adding async support for any system
call.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit fb5ccc9878 ("io_uring: Fix broken links with offloading")
introduced a potential performance regression with unconditionally
taking mm even for READ/WRITE_FIXED operations.
Return the logic handling it back. mm-faulted requests will go through
the generic submission path, so honoring links and drains, but will
fail further on req->has_user check.
Fixes: fb5ccc9878 ("io_uring: Fix broken links with offloading")
Cc: stable@vger.kernel.org # v5.4
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
submit->index is used only for inbound check in submission path (i.e.
head < ctx->sq_entries). However, it always will be true, as
1. it's already validated by io_get_sqring()
2. ctx->sq_entries can't be changedd in between, because of held
ctx->uring_lock and ctx->refs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To trace io_uring activity one can get an information from workqueue and
io trace events, but looks like some parts could be hard to identify via
this approach. Making what happens inside io_uring more transparent is
important to be able to reason about many aspects of it, hence introduce
the set of tracing events.
All such events could be roughly divided into two categories:
* those, that are helping to understand correctness (from both kernel
and an application point of view). E.g. a ring creation, file
registration, or waiting for available CQE. Proposed approach is to
get a pointer to an original structure of interest (ring context, or
request), and then find relevant events. io_uring_queue_async_work
also exposes a pointer to work_struct, to be able to track down
corresponding workqueue events.
* those, that provide performance related information. Mostly it's about
events that change the flow of requests, e.g. whether an async work
was queued, or delayed due to some dependencies. Another important
case is how io_uring optimizations (e.g. registered files) are
utilized.
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no function change, just to clean up the code, use s->in_async
to make the code know where it is.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.
Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.
If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:
struct io_uring_files_update {
__u32 offset;
__s32 *fds;
};
that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:
1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
replaced with ->fds[i].
For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.
This improves the performance of linked SQEs, and reduces the CPU
overhead.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.
Defer ring fd installation until after we're done reading those
values.
Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.
For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe
Reorder them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.
The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().
Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.
The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.
req->sequence
req_1 (count = 2): 2
req_2 (count = 1): 2
Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.
This reverts two commits:
bf7ec93c64c576666863
Fixes: c576666863 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW
9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc
dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7
v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca
VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y
cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML
wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp
RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M
s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I
JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f
Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2
E+DK7if0Gg==
=rvkw
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Keith that address deadlocks, double resets,
memory leaks, and other regression.
- Fixup elv_support_iosched() for bio based devices (Damien)
- Fixup for the ahci PCS quirk (Dan)
- Socket O_NONBLOCK handling fix for io_uring (me)
- Timeout sequence io_uring fixes (yangerkun)
- MD warning fix for parameter default_layout (Song)
- blkcg activation fixes (Tejun)
- blk-rq-qos node deletion fix (Tejun)
* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
nvme-pci: Set the prp2 correctly when using more than 4k page
io_uring: fix logic error in io_timeout
io_uring: fix up O_NONBLOCK handling for sockets
md/raid0: fix warning message for parameter default_layout
libata/ahci: Fix PCS quirk application
blk-rq-qos: fix first node deletion of rq_qos_del()
blkcg: Fix multiple bugs in blkcg_activate_policy()
io_uring: consider the overflow of sequence for timeout req
nvme-tcp: fix possible leakage during error flow
nvmet-loop: fix possible leakage during error flow
block: Fix elv_support_iosched()
nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
nvme: Wait for reset state when required
nvme: Prevent resets during paused controller state
nvme: Restart request timers in resetting state
nvme: Remove ADMIN_ONLY state
nvme-pci: Free tagset if no IO queues
nvme: retain split access workaround for capability reads
nvme: fix possible deadlock when nvme_update_formats fails
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.
Fixes: 5da0fb1ab3 ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've got two issues with the non-regular file handling for non-blocking
IO:
1) We don't want to re-do a short read in full for a non-regular file,
as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
we need to punt to async context even if the file is opened as
non-blocking. Otherwise the caller always gets -EAGAIN.
Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.
Cc: stable@vger.kernel.org
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:
1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.
2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.
This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2iBvAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvokEACaUozc0Y2L99CqJcRtoH8cwIAimjKTFSmi
UKijO/w11uc9xCIOospkM8iseRbpP/oCIwfYyVROi4LniPB1TSyQJoLF/MELeqIi
lOSg8RrXzC22h5+QG3BR5VO55me3MwIXmJLMMWXbh7FAuUg+wepVwHwzZBurd1Pz
C1wG8NU758AmqJdSGclUbdLLwTBkR+fXoi2EHfdbsBMqGmdZ5VRzCLx70grDcxq2
S9gt+2+diME6o6p1wko+m9pdhUdEy7epPE50K4SA+DU06I77OZpTdSAPilPyQHei
+3HDslzjCyos5/Ay+7hWl42jvyZ+x6GI5VcDLMTcMZsmxLy/WObDB4zcAixJrDV0
c7XWnwCCs2wMXYMsu0ASbkU32mcQ/Ik8CrX0W3/tP1aZBW2LZnOHvXYbybzpbgbo
QY47t3jz21uzF5kplBGxqaY1sPmLxp5V0vQmj9eN+Y7T+hiR/UgWKO/5B4CGNJe0
BQkXNl5cZAtqxmSE8AxlpOla7z6beZUljdhFc7K8cMDdhoatRgm2MBh2G9BB1ayo
VwHc1qLcvInpQrGiC4l5H/F8jrEs/PglknFSs1wlsav1gXA69J0cHA7LY6Czweaw
mkiKr7jA6duBBn6MRUThF5sBMFWYuxp3B/BA4wfDJsSfCmyhWN+OZlsDaEVrzk7o
/XK2Ew9R2Q==
=PUeL
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191012' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Single small fix for a regression in the sequence logic for linked
commands"
* tag 'for-linus-20191012' of git://git.kernel.dk/linux-block:
io_uring: fix sequence logic for timeout requests
We have two ways a request can be deferred:
1) It's a regular request that depends on another one
2) It's a timeout that tracks completions
We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.
Cc: stable@vger.kernel.org
Fixes: 6b06314c47 ("io_uring: add file set registration")
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().
Just use percpu_ref_put() instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2WrkYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjf1D/9wy2L/yAXA/KLQkwDZ2kn3tMtzMrsv6bOJ
4UTdlLWKH2ihGg69iAE5f19iQSeNmqqXxBz0VvIBpFR7TqcRbGe0d9iVtoOsDhpZ
c1OvQ2Ey35Xx1T2w6uMh0llgeWY2J/gJkY64unUxwUBUZwNOPA8ZjxqeXcMQmyAt
sYmXpNLCT/f4YTYOYDgzoh5960TsCB/H+m/bLEVRvr0MaonvlaBUKRcysQikDFCQ
yobzDmlSJqGKqlFJ2fnRVSkJC0BmBE5p8Ric9HHiUOT8BO31079IHUGbkbSh/csH
0yPipNaYNMv+Hr0t9pgfcNbAt2weMK5HFgtpQwv8Frl4xjvBSWDS5fQesCVDjkZt
+ROeOvQtjfeKtLy5PCu6BJwYpu8iYG9eGF8zxBQ4FBHM3tghcVhqssaNbfrVOW+u
YXYbAuLMkLwKlmJ+6WBiVIMefyF59ue3+UJGECiCrj/BrgxUyw8HcGKwpKEAZSok
VFGDukL0Y3flnoO/gyOf0GFaD5Uovr1sx82DCz05B/XEMfkqFMJRGkbyZBarJL69
9QrnyGpF4rwtfg+usR1PmJ+9/oY/ypSk8N9MAIkoK9e1YIexxvBiXAf0k8AxuDyC
uPuOiQgKcqUr3aF+ivao8dQB9NiK1bJGc4pqBPPN4ZYRSjMSfBT/cms4IeUyj0K6
sokcB1p+CQ==
=vKVl
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Mandate timespec64 for the io_uring timeout ABI (Arnd)
- Set of NVMe changes via Sagi:
- controller removal race fix from Balbir
- quirk additions from Gabriel and Jian-Hong
- nvme-pci power state save fix from Mario
- Add 64bit user commands (for 64bit registers) from Marta
- nvme-rdma/nvme-tcp fixes from Max, Mark and Me
- Minor cleanups and nits from James, Dan and John
- Two s390 dasd fixes (Jan, Stefan)
- Have loop change block size in DIO mode (Martijn)
- paride pg header ifdef guard (Masahiro)
- Two blk-mq queue scheduler tweaks, fixing an ordering issue on zoned
devices and suboptimal performance on others (Ming)
* tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block: (22 commits)
block: sed-opal: fix sparse warning: convert __be64 data
block: sed-opal: fix sparse warning: obsolete array init.
block: pg: add header include guard
Revert "s390/dasd: Add discard support for ESE volumes"
s390/dasd: Fix error handling during online processing
io_uring: use __kernel_timespec in timeout ABI
loop: change queue block size to match when using DIO
blk-mq: apply normal plugging for HDD
blk-mq: honor IO scheduler for multiqueue devices
nvme-rdma: fix possible use-after-free in connect timeout
nvme: Move ctrl sqsize to generic space
nvme: Add ctrl attributes for queue_count and sqsize
nvme: allow 64-bit results in passthru commands
nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
nvmet-tcp: remove superflous check on request sgl
Added QUIRKs for ADATA XPG SX8200 Pro 512GB
nvme-rdma: Fix max_hw_sectors calculation
nvme: fix an error code in nvme_init_subsystem()
nvme-pci: Save PCI state before putting drive into deepest state
nvme-tcp: fix wrong stop condition in io_work
...
All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.
Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2OIu4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsedD/4h54330vuq66DsGqzFLonLwFl5YHC5NeJX
aV38j7pAUPHvr9CSnr3d2VwTk/ThBtPI50I9/d9SXh8n1oAAA5C/+nPf1XknME47
giKHr3eb0FNLOySt/Ry284gla8mO0GZM83zUbDMnF0N+tfwAtFvbvgCpsPFK9vdL
xNzMLsq6++pj7p9c6IXd+zv0nmJzDikZQz6PtU1KYlbfnU7hh/cP3CrIHIPtGbyk
c/7tfbKTB9UnbW5guGakZt3BzNViJpK28SKn+S6AlLiEOYpC+55dBbaZQIy5qxHv
CZsx0GJIw0Ya0Lw3UEFp/74krLHq2610jmx/va8P7MZCjZAR675G3mjxKUnC/+SY
mEdLo6vghMNAIqMBWNu59CFQOPnqa8sqRii0q6cRWXSqKiFr1FLN8mstb3Ghh9K0
kGVA/gw6ESWB/e/X6I+pD6pTm6O6BPWEqBzGAWSvavQQIP9YpIzf5j+k3JsRu03/
IIzR6gW9k9u4k0rFlOJKbp1+AO5sK3VtJFR8JGELiRwwgjD91w50gjPYak5OGM37
Mi7OHCxqtwFGTkSvT6RM6om6onBsizrveszkrPUO01bWYIHHbtu6ofLyQlfnEtpv
qbGZtLW6KYj9VsIKZNDfg99Ff79IAOiAZDbXWAu/JKyg/gu1Y9uOiVkNFPJGPNHV
8ourcldMGg==
=DEYH
-----END PGP SIGNATURE-----
Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"Just two things in here:
- Improvement to the io_uring CQ ring wakeup for batched IO (me)
- Fix wrong comparison in poll handling (yangerkun)
I realize the first one is a little late in the game, but it felt
pointless to hold it off until the next release. Went through various
testing and reviews with Pavel and peterz"
* tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
io_uring: make CQ ring wakeups be more efficient
io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
For batched IO, it's not uncommon for waiters to ask for more than 1
IO to complete before being woken up. This is a problem with
wait_event() since tasks will get woken for every IO that completes,
re-check condition, then go back to sleep. For batch counts on the
order of what you do for high IOPS, that can result in 10s of extra
wakeups for the waiting task.
Add a private wake function that checks for the wake up count criteria
being met before calling autoremove_wake_function(). Pavel reports that
one test case he has runs 40% faster with proper batching of wakeups.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J9DEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqiqD/9D9etX/wXJ54bNKXpQqba3tUQy8e5SLnd4
q1ZzlJhvnmfNz6GqdH/1c+5Q2WBTLEb8PLJ4Z8d+1I8eWxpaCOzz362a7NA7x4l2
d1a8MV8gQy0moJKX7kTL9Pn44d2HofxyhgmHFKkvm/j7P4HIKbAuckzlcGoOS9TS
gV/bnroZ8eSrAfmswRRrxaWIoCe0iB+4x+nvoQjdVz8WXHcpBD8Y7hi5Ulid3Q/7
vU4WJubwKrP1dczk1L64CnYN4QfP2k4pyyxQwEBujueNuAxPKTRD3GXQc58c63Bp
GElE7FGxMUT4JhaY7U8G2pUSdAb2WiyMc9vd4Au7IkX0TEdQeRSHWkocnt2O9OPp
nk7XqPNK5MSskE4T9fjsJkE4KiOHhslixxc5qY6X9g5GuPw03ITM4VBwvU8aYxbf
p++jcWff10ovllck3uKvMpzU1IS20B4mSYYTyj990diocceEN6nsJsYv9wW4lMpk
lBL8b/Q3TaoMG58C5BiZvlf6JQqdKpN5UJhBmyrZAv5ozvgBAcSeLK41V6oxzGzo
y/3qcJeuIE6Exdm4DqWBHp7L82FcednO/EDqz3WnGx9v67kUz5VFqP7lIGNmXKM3
2PaBMWv7BU/Fx087KOsBOI5XWm3aVw+yTkHT2aitVQW0su+UDvb/BvFL0sqbBe3O
qm60F3TWkQ==
=/w5G
-----END PGP SIGNATURE-----
Merge tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"A collection of later fixes and additions, that weren't quite ready
for pushing out with the initial pull request.
This contains:
- Fix potential use-after-free of shadow requests (Jackie)
- Fix potential OOM crash in request allocation (Jackie)
- kmalloc+memcpy -> kmemdup cleanup (Jackie)
- Fix poll crash regression (me)
- Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)
- Add support for timeouts, making it easier to do epoll_wait()
conversions, for instance (me)
- Ensure io_uring works without f_ops->read_iter() and
f_ops->write_iter() (me)"
* tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
io_uring: correctly handle non ->{read,write}_iter() file_operations
io_uring: IORING_OP_TIMEOUT support
io_uring: use cond_resched() in sqthread
io_uring: fix potential crash issue due to io_get_req failure
io_uring: ensure poll commands clear ->sqe
io_uring: fix use-after-free of shadow_req
io_uring: use kmemdup instead of kmalloc and memcpy
Patch series "Make working with compound pages easier", v2.
These three patches add three helpers and convert the appropriate
places to use them.
This patch (of 3):
It's unnecessarily hard to find out the size of a potentially huge page.
Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).
Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After 75b28af("io_uring: allocate the two rings together"), we compare
sq.head with cached_cq_tail to determine does there any cq invalid.
Actually, we should use cq.head.
Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we just -EINVAL a read or write to an fd that isn't backed
by ->read_iter() or ->write_iter(). But we can handle them just fine,
as long as we punt fo async context first.
Implement a simple loop function for doing ->read() or ->write()
instead, and ensure we call it appropriately.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's been a few requests for functionality similar to io_getevents()
and epoll_wait(), where the user can specify a timeout for waiting on
events. I deliberately did not add support for this through the system
call initially to avoid overloading the args, but I can see that the use
cases for this are valid.
This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
when waiting for events, simply submit one of these timeout commands
with your wait call (or before). This ensures that the application
sleeping on the CQ ring waiting for events will get woken. The timeout
command is passed in as a pointer to a struct timespec. Timeouts are
relative. The timeout command also includes a way to auto-cancel after
N events has passed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If preempt isn't enabled in the kernel, we can run into hang issues with
sqthread submissions. Use cond_resched() to play nice instead of
cpu_relax(), if we end up starting the loop and not having any events
pending for submissions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sometimes io_get_req will return a NUL, then we need to do the
correct error handling, otherwise it will cause the kernel null
pointer exception.
Fixes: 4fe2c96315 ("io_uring: add support for link with drain")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we end up getting woken in poll (due to a signal), then we may need
to punt the poll request to an async worker. When we do that, we look up
the list to queue at, deferefencing req->submit.sqe, however that is
only set for requests we initially decided to queue async.
This fixes a crash with poll command usage and wakeups that need to punt
to async context.
Fixes: 54a91f3bb9 ("io_uring: limit parallelism of buffered writes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a potential dangling pointer problem. we never clean
shadow_req, if there are multiple link lists in this series of
sqes, then the shadow_req will not reallocate, and continue to
use the last one. but in the previous, his memory has been
released, thus forming a dangling pointer. let's clean up him
and make sure that every new link list can reapply for a new
shadow_req.
Fixes: 4fe2c96315 ("io_uring: add support for link with drain")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some workloads can require far more than 4K oustanding entries. For
example memcached can have ~300K sockets over ~40 cores. Bumping the max
to 32K seems to work pretty well.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The way the logic is setup in io_uring_enter() means that you can't wake
up the SQ poller thread while at the same time waiting (or polling) for
completions afterwards. There's no reason for that to be the case.
Reported-by: Lewis Baker <lbaker@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently merge async work items if we see a strict sequential hit.
This helps avoid unnecessary workqueue switches when we don't need
them. We can extend this merging to cover cases where it's not a strict
sequential hit, but the IO still fits within the same page. If an
application is doing multiple requests within the same page, we don't
want separate workers waiting on the same page to complete IO. It's much
faster to let the first worker bring in the page, then operate on that
page from the same worker to complete the next request(s).
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All the popular filesystems need to grab the inode lock for buffered
writes. With io_uring punting buffered writes to async context, we
observe a lot of contention with all workers hamming this mutex.
For buffered writes, we generally don't need a lot of parallelism on
the submission side, as the flushing will take care of that for us.
Hence we don't need a deep queue on the write side, as long as we
can safely punt from the original submission context.
Add a workqueue with a limit of 2 that we can use for buffered writes.
This greatly improves the performance and efficiency of higher queue
depth buffered async writes with io_uring.
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper for queueing a request for async execution, in preparation
for optimizing it.
No functional change in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For some applications that end up using a submit-and-wait type of
approach for certain batches of IO, we can make that a bit more
efficient by allowing the application to block for the last IO
submission. This prevents an async when we don't need it, as the
application will be blocking for the completion event(s) anyway.
Typical use cases are using the liburing
io_uring_submit_and_wait() API, or just using io_uring_enter()
doing both submissions and completions. As a specific example,
RocksDB doing MultiGet() is sped up quite a bit with this
change.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To support the link with drain, we need to do two parts.
There is an sqes:
0 1 2 3 4 5 6
+-----+-----+-----+-----+-----+-----+-----+
| N | L | L | L+D | N | N | N |
+-----+-----+-----+-----+-----+-----+-----+
First, we need to ensure that the io before the link is completed,
there is a easy way is set drain flag to the link list's head, so
all subsequent io will be inserted into the defer_list.
+-----+
(0) | N |
+-----+
| (2) (3) (4)
+-----+ +-----+ +-----+ +-----+
(1) | L+D | --> | L | --> | L+D | --> | N |
+-----+ +-----+ +-----+ +-----+
|
+-----+
(5) | N |
+-----+
|
+-----+
(6) | N |
+-----+
Second, ensure that the following IO will not be completed first,
an easy way is to create a mirror of drain io and insert it into
defer_list, in this way, as long as drain io is not processed, the
following io in the defer_list will not be actively process.
+-----+
(0) | N |
+-----+
| (2) (3) (4)
+-----+ +-----+ +-----+ +-----+
(1) | L+D | --> | L | --> | L+D | --> | N |
+-----+ +-----+ +-----+ +-----+
|
+-----+
('3) | D | <== This is a shadow of (3)
+-----+
|
+-----+
(5) | N |
+-----+
|
+-----+
(6) | N |
+-----+
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sqo_thread will get sqring in batches, which will cause
ctx->cached_sq_head to be added in batches. if one of these
sqes is set with the DRAIN flag, then he will never get a
chance to process, and finally sqo_thread will not exit.
Fixes: de0617e467 ("io_uring: add support for marking commands as draining")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After commit 75b28affdd we can get by with just a single mmap to
map both the sq and cq ring. However, userspace doesn't know that.
Add a features variable to io_uring_params, and notify userspace
that the kernel has this ability. This can then be used in liburing
(or in applications directly) to avoid the second mmap.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both the sq and the cq rings have sizes just over a power of two, and
the sq ring is significantly smaller. By bundling them in a single
alllocation, we get the sq ring for free.
This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean
the same thing. If we indicate this to userspace, we can save a mmap
call.
Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca
("mm: introduce put_user_page*(), placeholder versions").
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The outer poll loop checks for whether we need to reschedule, and
returns to userspace if we do. However, it's possible to get stuck
in the inner loop as well, if the CPU we are running on needs to
reschedule to finish the IO work.
Add the need_resched() check in the inner loop as well. This fixes
a potential hang if the kernel is configured with
CONFIG_PREEMPT_VOLUNTARY=y.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to check if we have CQEs pending before starting a poll loop,
as those could be the events we will be spinning for (and hence we'll
find none). This can happen if a CQE triggers an error, or if it is
found by eg an IRQ before we get a chance to find it through polling.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a request issue ends up being punted to async context to avoid
blocking, we can get into a situation where the original application
enters the poll loop for that very request before it has been issued.
This should not be an issue, except that the polling will hold the
io_uring uring_ctx mutex for the duration of the poll. When the async
worker has actually issued the request, it needs to acquire this mutex
to add the request to the poll issued list. Since the application
polling is already holding this mutex, the workqueue sleeps on the
mutex forever, and the application thus never gets a chance to poll for
the very request it was interested in.
Fix this by ensuring that the polling drops the uring_ctx occasionally
if it's not making any progress.
Reported-by: Jeffrey M. Birnbaum <jmbnyc@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch may fix two issues:
First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
defer list to delay execution, but link io will be actively scheduled to
run by calling io_queue_sqe.
Second, when multiple LINK_IOs are inserted together with defer_list,
the LINK_IO is no longer keep order.
|-------------|
| LINK_IO | ----> insert to defer_list -----------
|-------------| |
| LINK_IO | ----> insert to defer_list ----------|
|-------------| |
| LINK_IO | ----> insert to defer_list ----------|
|-------------| |
| NORMAL_IO | ----> insert to defer_list ----------|
|-------------| |
|
queue_work at same time <-----|
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit bd11b3a391 ("io_uring: don't use iov_iter_advance() for fixed
buffers") introduced an optimization to avoid using the slow
iov_iter_advance by manually populating the iov_iter iterator in some
cases.
However, the computation of the iterator count field was erroneous: The
first bvec was always accounted for an extent of page size even if the
bvec length was smaller.
In consequence, some I/O operations on fixed buffers were unable to
operate on the full extent of the buffer, consistently skipping some
bytes at the end of it.
Fixes: bd11b3a391 ("io_uring: don't use iov_iter_advance() for fixed buffers")
Cc: stable@vger.kernel.org
Signed-off-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z
0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH
d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB
0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc
vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B
p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j
uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4
1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn
/Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3
oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L
saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071
BphJd2RamQ==
=HIzH
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Several io_uring fixes/improvements:
- Blocking fix for O_DIRECT (me)
- Latter page slowness for registered buffers (me)
- Fix poll hang under certain conditions (me)
- Defer sequence check fix for wrapped rings (Zhengyuan)
- Mismatch in async inc/dec accounting (Zhengyuan)
- Memory ordering issue that could cause stall (Zhengyuan)
- Track sequential defer in bytes, not pages (Zhengyuan)
- NVMe pull request from Christoph
- Set of hang fixes for wbt (Josef)
- Redundant error message kill for libahci (Ding)
- Remove unused blk_mq_sched_started_request() and related ops (Marcos)
- drbd dynamic alloc shash descriptor to reduce stack use (Arnd)
- blkcg ->pd_stat() non-debug print (Tejun)
- bcache memory leak fix (Wei)
- Comment fix (Akinobu)
- BFQ perf regression fix (Paolo)
* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
io_uring: ensure ->list is initialized for poll commands
Revert "nvme-pci: don't create a read hctx mapping without read queues"
nvme: fix multipath crash when ANA is deactivated
nvme: fix memory leak caused by incorrect subsystem free
nvme: ignore subnqn for ADATA SX6000LNP
drbd: dynamically allocate shash descriptor
block: blk-mq: Remove blk_mq_sched_started_request and started_request
bcache: fix possible memory leak in bch_cached_dev_run()
io_uring: track io length in async_list based on bytes
io_uring: don't use iov_iter_advance() for fixed buffers
block: properly handle IOCB_NOWAIT for async O_DIRECT IO
blk-mq: allow REQ_NOWAIT to return an error inline
io_uring: add a memory barrier before atomic_read
rq-qos: use a mb for got_token
rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
rq-qos: don't reset has_sleepers on spurious wakeups
rq-qos: fix missed wake-ups in rq_qos_throttle
wait: add wq_has_single_sleeper helper
block, bfq: check also in-flight I/O in dispatch plugging
block: fix sysfs module parameters directory path in comment
...
Daniel reports that when testing an http server that uses io_uring
to poll for incoming connections, sometimes it hard crashes. This is
due to an uninitialized list member for the io_uring request. Normally
this doesn't trigger and none of the test cases caught it.
Reported-by: Daniel Kozak <kozzi11@gmail.com>
Tested-by: Daniel Kozak <kozzi11@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We are using PAGE_SIZE as the unit to determine if the total len in
async_list has exceeded max_pages, it's not fair for smaller io sizes.
For example, if we are doing 1k-size io streams, we will never exceed
max_pages since len >>= PAGE_SHIFT always gets zero. So use original
bytes to make it more accurate.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hrvoje reports that when a large fixed buffer is registered and IO is
being done to the latter pages of said buffer, the IO submission time
is much worse:
reading to the start of the buffer: 11238 ns
reading to the end of the buffer: 1039879 ns
In fact, it's worse by two orders of magnitude. The reason for that is
how io_uring figures out how to setup the iov_iter. We point the iter
at the first bvec, and then use iov_iter_advance() to fast-forward to
the offset within that buffer we need.
However, that is abysmally slow, as it entails iterating the bvecs
that we setup as part of buffer registration. There's really no need
to use this generic helper, as we know it's a BVEC type iterator, and
we also know that each bvec is PAGE_SIZE in size, apart from possibly
the first and last. Hence we can just use a shift on the offset to
find the right index, and then adjust the iov_iter appropriately.
After this fix, the timings are:
reading to the start of the buffer: 10135 ns
reading to the end of the buffer: 1377 ns
Or about an 755x improvement for the tail page.
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a hang issue while using fio to do some basic test. The issue
can be easily reproduced using the below script:
while true
do
fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \
-size=1G -iodepth=64 -name=uring --filename=/dev/zero
done
After several minutes (or more), fio would block at
io_uring_enter->io_cqring_wait in order to waiting for previously
committed sqes to be completed and can't return to user anymore until
we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at
io_ring_ctx_wait_and_kill with a backtrace like this:
[54133.243816] Call Trace:
[54133.243842] __schedule+0x3a0/0x790
[54133.243868] schedule+0x38/0xa0
[54133.243880] schedule_timeout+0x218/0x3b0
[54133.243891] ? sched_clock+0x9/0x10
[54133.243903] ? wait_for_completion+0xa3/0x130
[54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40
[54133.243930] ? trace_hardirqs_on+0x3f/0xe0
[54133.243951] wait_for_completion+0xab/0x130
[54133.243962] ? wake_up_q+0x70/0x70
[54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0
[54133.243998] io_uring_release+0x20/0x30
[54133.244008] __fput+0xcf/0x270
[54133.244029] ____fput+0xe/0x10
[54133.244040] task_work_run+0x7f/0xa0
[54133.244056] do_exit+0x305/0xc40
[54133.244067] ? get_signal+0x13b/0xbd0
[54133.244088] do_group_exit+0x50/0xd0
[54133.244103] get_signal+0x18d/0xbd0
[54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60
[54133.244142] do_signal+0x34/0x720
[54133.244171] ? exit_to_usermode_loop+0x7e/0x130
[54133.244190] exit_to_usermode_loop+0xc0/0x130
[54133.244209] do_syscall_64+0x16b/0x1d0
[54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe
The reason is that we had added a req to ctx->pending_async at the very
end, but it didn't get a chance to be processed. How could this happen?
fio#cpu0 wq#cpu1
io_add_to_prev_work io_sq_wq_submit_work
atomic_read() <<< 1
atomic_dec_return() << 1->0
list_empty(); <<< true;
list_add_tail()
atomic_read() << 0 or 1?
As atomic_ops.rst states, atomic_read does not guarantee that the
runtime modification by any other thread is visible yet, so we must take
care of that with a proper implicit or explicit memory barrier.
This issue was detected with the help of Jackie's <liuyun01@kylinos.cn>
Fixes: 31b5151064 ("io_uring: allow workqueue item to handle multiple buffered requests")
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths. This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.
This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().
Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We could queue a work for each req in defer and link list without
increasing async_list->cnt, so we shouldn't decrease it while exiting
from workqueue as well if we didn't process the req in async list.
Thanks to Jens Axboe <axboe@kernel.dk> for his guidance.
Fixes: 31b5151064 ("io_uring: allow workqueue item to handle multiple buffered requests")
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If
cached_sq_head overflows before cached_cq_tail, then we may miss a
barrier req. As cached_cq_tail always follows cached_sq_head, the NQ
should be enough.
Cc: stable@vger.kernel.org
Fixes: de0617e467 ("io_uring: add support for marking commands as draining")
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull percpu updates from Dennis Zhou:
"This includes changes to let percpu_ref release the backing percpu
memory earlier after it has been switched to atomic in cases where the
percpu ref is not revived.
This will help recycle percpu memory earlier in cases where the
refcounts are pinned for prolonged periods of time"
* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0nUl4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplCaEACa7c7ybRgV6CZpS9PXXVqBJqYILRLBXsnn
MXomrSdjKMs0Y928RAQHFNWh3HaHHtdvSmFvwOpfF5lyrYXVfc9MQ7brqarDp1t2
f4jvkL63BVG2Zs/VL8QVAz+CwtCF39hduzUR/Y9j/4+rJNhSBMNJLY0nlB5weCTy
MJetnLoQ9ETA2+xu49vAM/PFJgBynNUAyUer918y8QysJRj90/VnhieQmrVb4tpG
Q4yZFKq4YPDs0tLEX4Nj6eJERcyW/4MC2oZ0aPXU4g2Dc3SVWaSNOo5WpkP+crGt
0dbyLmhomteE6+Kaco1hAWIkG/RuvgiMzDizryi0enXP51edV3Vnwyg3MQSUhcnf
Pn7vrDkajKBE9rFGlLy8V4gkKdS8XJQy2xA1MWm3aWgGl4v0j64EXIe0IhIK30vU
25A9jLDcdgr74+Lw+vWLLd+oeGD0iFf6wiEp+3jzEdtfVNE/lD6yilTzbdz2V0UK
8T1sRLMEkaG7CbxOVc1UAfcvObjuqQihEI0fQvl4yxV178h8mtWB87YmV2S2EhzP
v6FSxiC1yZ7J+rwb/Mff7+1GoOgzrpS/zESk2WMTgcwVdiwFfv5eIC26ZNWObJ/x
IY+4xRgTf2dEsjBeumOuBzxTfzrZb+pTO4GCa4O+t0UDQRIwl0y20pTXKtxU3y/U
gKPXEjgXrQ==
=jDiB
-----END PGP SIGNATURE-----
Merge tag 'for-5.3/io_uring-20190711' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"This contains:
- Support for recvmsg/sendmsg as first class opcodes.
I don't envision going much further down this path, as there are
plans in progress to support potentially any system call in an
async fashion through io_uring. But I think it does make sense to
have certain core ops available directly, especially those that can
support a "try this non-blocking" flag/mode. (me)
- Handle generic short reads automatically.
This can happen fairly easily if parts of the buffered read is
cached. Since the application needs to issue another request for
the remainder, just do this internally and save kernel/user
roundtrip while providing a nicer more robust API. (me)
- Support for linked SQEs.
This allows SQEs to depend on each other, enabling an application
to eg queue a read-from-this-file,write-to-that-file pair. (me)
- Fix race in stopping SQ thread (Jackie)"
* tag 'for-5.3/io_uring-20190711' of git://git.kernel.dk/linux-block:
io_uring: fix io_sq_thread_stop running in front of io_sq_thread
io_uring: add support for recvmsg()
io_uring: add support for sendmsg()
io_uring: add support for sqe links
io_uring: punt short reads to async context
uio: make import_iovec()/compat_import_iovec() return bytes on success
This is done through IORING_OP_RECVMSG. This opcode uses the same
sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
msghdr struct in the sqe->addr field as well.
We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
for the flags argument, and the msghdr struct is passed in the
sqe->addr field.
We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0jrIMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptlFD/9CNsBX+Aap2lO6wKNr6QISwNAK76GMzEay
s4LSY2kGkXvzv8i89mCuY+8UVNI8WH2/22WnU+8CBAJOjWyFQMsIwH/mrq0oZWRD
J6STJE8rTr6Fc2MvJUWryp/xdBh3+eDIsAdIZVHVAkIzqYPBnpIAwEIeIw8t0xsm
v9ngpQ3WD6ep8tOj9pnG1DGKFg1CmukZCC/Y4CQV1vZtmm2I935zUwNV/TB+Egfx
G8JSC0cSV02LMK88HCnA6MnC/XSUC0qgfXbnmP+TpKlgjVX+P/fuB3oIYcZEu2Rk
3YBpIkhsQytKYbF42KRLsmBH72u6oB9G+tNZTgB1STUDrZqdtD9xwX1rjDlY0ZzP
EUDnk48jl/cxbs+VZrHoE2TcNonLiymV7Kb92juHXdIYmKFQStprGcQUbMaTkMfB
6BYrYLifWx0leu1JJ1i7qhNmug94BYCSCxcRmH0p6kPazPcY9LXNmDWMfMuBPZT7
z79VLZnHF2wNXJyT1cBluwRYYJRT4osWZ3XUaBWFKDgf1qyvXJfrN/4zmgkEIyW7
ivXC+KLlGkhntDlWo2pLKbbyOIKY1HmU6aROaI11k5Zyh0ixKB7tHKavK39l+NOo
YB41+4l6VEpQEyxyRk8tO0sbHpKaKB+evVIK3tTwbY+Q0qTExErxjfWUtOgRWhjx
iXJssPRo4w==
=VSYT
-----END PGP SIGNATURE-----
Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main block updates for 5.3. Nothing earth shattering or
major in here, just fixes, additions, and improvements all over the
map. This contains:
- Series of documentation fixes (Bart)
- Optimization of the blk-mq ctx get/put (Bart)
- null_blk removal race condition fix (Bob)
- req/bio_op() cleanups (Chaitanya)
- Series cleaning up the segment accounting, and request/bio mapping
(Christoph)
- Series cleaning up the page getting/putting for bios (Christoph)
- block cgroup cleanups and moving it to where it is used (Christoph)
- block cgroup fixes (Tejun)
- Series of fixes and improvements to bcache, most notably a write
deadlock fix (Coly)
- blk-iolatency STS_AGAIN and accounting fixes (Dennis)
- Series of improvements and fixes to BFQ (Douglas, Paolo)
- debugfs_create() return value check removal for drbd (Greg)
- Use struct_size(), where appropriate (Gustavo)
- Two lighnvm fixes (Heiner, Geert)
- MD fixes, including a read balance and corruption fix (Guoqing,
Marcos, Xiao, Yufen)
- block opal shadow mbr additions (Jonas, Revanth)
- sbitmap compare-and-exhange improvemnts (Pavel)
- Fix for potential bio->bi_size overflow (Ming)
- NVMe pull requests:
- improved PCIe suspent support (Keith Busch)
- error injection support for the admin queue (Akinobu Mita)
- Fibre Channel discovery improvements (James Smart)
- tracing improvements including nvmetc tracing support (Minwoo Im)
- misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
Kulkarni)"
- Various little fixes and improvements to drivers and core"
* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
blk-iolatency: fix STS_AGAIN handling
block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
blk-mq: simplify blk_mq_make_request()
blk-mq: remove blk_mq_put_ctx()
sbitmap: Replace cmpxchg with xchg
block: fix .bi_size overflow
block: sed-opal: check size of shadow mbr
block: sed-opal: ioctl for writing to shadow mbr
block: sed-opal: add ioctl for done-mark of shadow mbr
block: never take page references for ITER_BVEC
direct-io: use bio_release_pages in dio_bio_complete
block_dev: use bio_release_pages in bio_unmap_user
block_dev: use bio_release_pages in blkdev_bio_end_io
iomap: use bio_release_pages in iomap_dio_bio_end_io
block: use bio_release_pages in bio_map_user_iov
block: use bio_release_pages in bio_unmap_user
block: optionally mark pages dirty in bio_release_pages
block: move the BIO_NO_PAGE_REF check into bio_release_pages
block: skd_main.c: Remove call to memset after dma_alloc_coherent
block: mtip32xx: Remove call to memset after dma_alloc_coherent
...
If we pass pages through an iov_iter we always already have a reference
in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL
mm, swap: fix THP swap out
fork,memcg: alloc_thread_stack_node needs to set tsk->stack
MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info
mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
mm/page_idle.c: fix oops because end_pfn is larger than max_pfn
initramfs: fix populate_initrd_image() section mismatch
mm/oom_kill.c: fix uninitialized oc->constraint
mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
signal: remove the wrong signal_pending() check in restore_user_sigmask()
fs/binfmt_flat.c: make load_flat_shared_library() work
mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask
fs/proc/array.c: allow reporting eip/esp for all coredumping threads
mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual address
This is the minimal fix for stable, I'll send cleanups later.
Commit 854a6ed568 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.
Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.
Eric said:
: For clarity. I don't think this is required by posix, or fundamentally to
: remove the races in select. It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented. The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?
Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed568 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Eric Wong <e@80x24.org>
Tested-by: Eric Wong <e@80x24.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With SQE links, we can create chains of dependent SQEs. One example
would be queueing an SQE that's a read from one file descriptor, with
the linked SQE being a write to another with the same set of buffers.
An SQE link will not stall the pipeline, it'll just ensure that
dependent SQEs aren't issued before the previous link has completed.
Any error at submission or completion time will break the chain of SQEs.
For completions, this also includes short reads or writes, as the next
SQE could depend on the previous one being fully completed.
Any SQE in a chain that gets canceled due to any of the above errors,
will get an CQE fill with -ECANCELED as the error value.
Signed-off-by: Jens Axboe <axboe@kernel.dk>