Unlike the jiffy scheduling version, schedule_hrtimeout() jumps a few
functions before getting into schedule() even if there is no actual
timeout needed. Some tests showed that it takes up to 1% of CPU cycles.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89f880574eceee6f4899783377ead234df7b3d04.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_cqring_wait_schedule() is called after we started waiting on the cq
wq and set the state to TASK_INTERRUPTIBLE, for that reason we have to
constantly worry whether we has returned the state back to running or
not. Leave only quick checks in io_cqring_wait_schedule() and move the
rest including running task work to the callers. Note, we run tw in the
loop after the sched checks because of the fast path in the beginning of
the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2814fabe75e2e019e7ca43ea07daa94564349805.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most places that want to run local tw explicitly and in advance check if
they are allowed to do so. Don't rely on a similar check in
__io_run_local_work(), leave it as a just-in-case warning and make sure
callers checks capabilities themselves.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/990fe0e8e70fd4d57e43625e5ce8fba584821d1a.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Task work runners keep running until all queues tw items are exhausted.
It's also rare for defer tw to queue normal tw and vise versa. Taking it
into account, there is only a dim chance that further iterating the
io_cqring_wait() fast path will get us anything and so we can remove
the loop there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1f9565726661266abaa5d921e97433c831759ecf.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drain requests all go through io_drain_req, which has a quick exit in case
there is nothing pending (ie the drain is not useful). In that case it can
run the issue the request immediately.
However for safety it queues it through task work.
The problem is that in this case the request is run asynchronously, but
the async work has not been prepared through io_req_prep_async.
This has not been a problem up to now, as the task work always would run
before returning to userspace, and so the user would not have a chance to
race with it.
However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and
the work might be defered, giving userspace a chance to change data being
referred to in the request.
Instead _always_ prep_async for drain requests, which is simpler anyway
and removes this issue.
Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127105911.2420061-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SETUP_R_DISABLED rings don't have the submitter task set, so
it's not always safe to use ->submitter_task. Disallow posting msg_ring
messaged to disabled rings. Also add task NULL check for loosy sync
around testing for IORING_SETUP_R_DISABLED.
Cc: stable@vger.kernel.org
Fixes: 6d043ee116 ("io_uring: do msg_ring in target task via tw")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings".
Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first
requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings.
CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings.
This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for
example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is
actually means.
Let's first get the weird case out of the way and not use VM_MAYSHARE in
MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead.
This patch (of 3):
We want to stop using VM_MAYSHARE in private mappings to pave the way for
clarifying the semantics of VM_MAYSHARE vs. VM_SHARED and reduce the
confusion. While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED,
!CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings
that are an effective overlay of a file mapping.
Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into
is_nommu_shared_mapping() first.
Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well
(unless there is a serious BUG). So there is not need to test for
VM_SHARED manually.
No functional change intended.
Link: https://lkml.kernel.org/r/20230102160856.500584-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230102160856.500584-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiffy to ktime CQ waiting conversion broke how we treat timeouts, in
particular we rearm it anew every time we get into
io_cqring_wait_schedule() without adjusting the timeout. Waiting for 2
CQEs and getting a task_work in the middle may double the timeout value,
or even worse in some cases task may wait indefinitely.
Cc: stable@vger.kernel.org
Fixes: 228339662b ("io_uring: don't convert to jiffies for waiting on timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7bffddd71b08f28a877d44d37ac953ddb01590d.1672915663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unlike normal tw, nothing prevents deferred tw to be executed right
after an tw item added to ->work_llist in io_req_local_work_add(). For
instance, the waiting task may get waken up by CQ posting or a normal
tw. Thus we need to pin the ring for the rest of io_req_local_work_add()
Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1a79362b9c10b8523ef70b061d96523650a23344.1672795998.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only check the register opcode value inside the restricted ring
section, move it into the main io_uring_register() function instead
and check it up front.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have overflow entries being generated after we've done the
initial flush in io_cqring_wait(), then we could be flushing them in the
main wait loop as well. If that's done after having added ourselves
to the cq_wait waitqueue, then the task state can be != TASK_RUNNING
when we enter the overflow flush.
Check for the need to overflow flush, and finish our wait cycle first
if we have to do so.
Reported-and-tested-by: syzbot+cf6ea1d6bb30a4ce10b2@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/000000000000cb143a05f04eee15@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's quite possible that we got woken up because task_work was queued,
and we need to process this task_work to generate the events waited for.
If we return to the wait loop without running task_work, we'll end up
adding the task to the waitqueue again, only to call
io_cqring_wait_schedule() again which will run the task_work. This is
less efficient than it could be, as it requires adding to the cq_wait
queue again. It also triggers the wakeup path for completions as
cq_wait is now non-empty with the task itself, and it'll require another
lock grab and deletion to remove ourselves from the waitqueue.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring uses call_rcu in the case it needs to signal an eventfd as a
result of an eventfd signal, since recursing eventfd signals are not
allowed. This should be calling the new call_rcu_hurry API to not delay
the signal.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20221215184138.795576-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because the single task locking series got reordered ahead of the
timeout and completion lock changes, two hunks inadvertently ended up
using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant
that we dropped overflow handling in those two spots. Reinstate the
correct CQE filling helper.
Fixes: f66f73421f ("io_uring: skip spinlocking for ->task_complete")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_kill_timeouts() doesn't post any events but queues everything to
task_work. Locking there is needed for protecting linked requests
traversing, we should grab completion_lock directly instead of using
io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSZJAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpp+OD/9M04tGsVFCdqtKty5lBlDs03OnA03f404c
Ottp2pop2JrBx7+ycS/dl78MQjmghh8Ectel8kOiswDeRQc98TtzWY31DF1d56yS
aGYCq2Ww2gx5ziYmJgiFU7RRLTFlpfa/vZUBMK4HW4MYm2ihxtfNc72Oa8H9KGDJ
/RYk5+PSCau+UFwyWu91rORVNRXjLr1mFmgzRTmFhL2unYYuOO83mK4GpK2f8rHx
qpT7Wn9IS9xiTpr8rHqs8y6rxV6+Tnv/HqR8kKoviHvQU/u6fzvKSNEu1NvK/Znm
V0z8cI4JJZUelDyqJe5ITq8cIS59amzILEIneclYQd20NkqcFYPlS56K6qR9qL7J
6eNHvgH7iKvnk9JlR2soKojC6KWEPtVni2BjPEXgXHrfUWdINMKrT6MwTNOjztWj
h+EaqLBGQb5m/nRCCMeE9kfUK23Rg6Ev+H+aas0SgqD5Isg/hVG+aMtjWLmWquCU
pKg2UqxqsR9ymKj92KJSoN7F8Z2U0JsoHBKzAammlnmfhxl294RjGqBMJjjI5eS9
Zu+fTte4EuFY5s/TvE5FCBmQ0Oeg+ud2f+GKXDzF25equtct8QCfHIbguM6Yr3X2
3ANYGtLwP7Cj96U1Y++RWgTpBTTKwGkWyzEyS9SN/+MRXhIjeP8JZeGdXBDWaXLC
Vp7j39Avgg==
=MzhS
-----END PGP SIGNATURE-----
Merge tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux
Pull io_uring updates part two from Jens Axboe:
- Misc fixes (me, Lin)
- Series from Pavel extending the single task exclusive ring mode,
yielding nice improvements for the common case of having a single
ring per thread (Pavel)
- Cleanup for MSG_RING, removing our IOPOLL hack (Pavel)
- Further poll cleanups and fixes (Pavel)
- Misc cleanups and fixes (Pavel)
* tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux: (22 commits)
io_uring/msg_ring: flag target ring as having task_work, if needed
io_uring: skip spinlocking for ->task_complete
io_uring: do msg_ring in target task via tw
io_uring: extract a io_msg_install_complete helper
io_uring: get rid of double locking
io_uring: never run tw and fallback in parallel
io_uring: use tw for putting rsrc
io_uring: force multishot CQEs into task context
io_uring: complete all requests in task context
io_uring: don't check overflow flush failures
io_uring: skip overflow CQE posting for dying ring
io_uring: improve io_double_lock_ctx fail handling
io_uring: dont remove file from msg_ring reqs
io_uring: reshuffle issue_flags
io_uring: don't reinstall quiesce node for each tw
io_uring: improve rsrc quiesce refs checks
io_uring: don't raw spin unlock to match cq_lock
io_uring: combine poll tw handlers
io_uring: improve poll warning handling
io_uring: remove ctx variable in io_poll_check_events
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSYp4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgWNEADTcHejQ5y8Ctc1BEsCiEnCbR5DeIwESBKN
SfK7uK5N5GjKZgKeMtkdxNeckH2wSZ79VFpz2/b4fszM11H+P81r/PP+OwQojcra
1YyPbp2jXlQZGEYijiXYDohnKr8NnJYj/nuwm4T73OPOD48ekbrY36/t3Hd75jKk
M5L/HOpbKhA+fypPHYlm3XlwEM9/4wupsDeiabTuAeFpLh66V/h85ZLY91c3Bf7j
yllzT2CCGQsh+fnqdovpsqUxM4Sh6sHfa2QhwkfJFfU9OLwDz0TLlcvSVwWWICD1
DGeDE/MPBKZ4Z5zp8t94vTnZDAiExwkZbmCOnOkdYoPdMDqCDhp8a9WywV3IiJJi
+hRN6Au6BWw8Rj3b3pbobfdXFJAvoHqXHM1SDAsWmEIMpuMhNalIpYqtna9JwRDe
KV2ERMpGOUpmvQiRaJa5Wtz6hlFetmuav6Ij9DWb5f8+NXikiFAXk4g6TtD1Z7Cb
15klwM8qW50zX7YUlzBLcjNdj7+HuIswLx0obZ3Uv1ogDwMQKXEvPaAHUwVN02q6
cWP0P2Bc0EKvdwTSNpgNflhNsiV2DFJZpZfxkdPauN4t2EkJ38iEpHewffy0ecM7
uyaXGQVQW7WFDuGno4cGbThQIG95MqkRPBhEyB4cOmcyS/aZ92ZFtV1iI8dVDA+v
uuEIMc3OCA==
=EgDc
-----END PGP SIGNATURE-----
Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Always ensure proper ordering in case of CQ ring overflow, which then
means we can remove some work-arounds for that (Dylan)
- Support completion batching for multishot, greatly increasing the
efficiency for those (Dylan)
- Flag epoll/eventfd wakeups done from io_uring, so that we can easily
tell if we're recursing into io_uring again.
Previously, this would have resulted in repeated multishot
notifications if we had a dependency there. That could happen if an
eventfd was registered as the ring eventfd, and we multishot polled
for events on it. Or if an io_uring fd was added to epoll, and
io_uring had a multishot request for the epoll fd.
Test cases here:
https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd
Previously these got terminated when the CQ ring eventually
overflowed, now it's handled gracefully (me).
- Tightening of the IOPOLL based completions (Pavel)
- Optimizations of the networking zero-copy paths (Pavel)
- Various tweaks and fixes (Dylan, Pavel)
* tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits)
io_uring: keep unlock_post inlined in hot path
io_uring: don't use complete_post in kbuf
io_uring: spelling fix
io_uring: remove io_req_complete_post_tw
io_uring: allow multishot polled reqs to defer completion
io_uring: remove overflow param from io_post_aux_cqe
io_uring: add lockdep assertion in io_fill_cqe_aux
io_uring: make io_fill_cqe_aux static
io_uring: add io_aux_cqe which allows deferred completion
io_uring: allow defer completion for aux posted cqes
io_uring: defer all io_req_complete_failed
io_uring: always lock in io_apoll_task_func
io_uring: remove iopoll spinlock
io_uring: iopoll protect complete_post
io_uring: inline __io_req_complete_put()
io_uring: remove io_req_tw_post_queue
io_uring: use io_req_task_complete() in timeout
io_uring: hold locks for io_req_complete_failed
io_uring: add completion locking for iopoll
io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
...
->task_complete was added to serialised CQE posting by doing it from
the task context only (or fallback wq when the task is dead), and now we
can use that to avoid taking ->completion_lock while filling CQ entries.
The patch skips spinlocking only in two spots,
__io_submit_flush_completions() and flushing in io_aux_cqe, it's safer
and covers all cases we care about. Extra care is taken to force taking
the lock while queueing overflow entries.
It fundamentally relies on SINGLE_ISSUER to have only one task posting
events. It also need to take into account overflowed CQEs, flushing of
which happens in the cq wait path, and so this implementation also needs
DEFER_TASKRUN to limit waiters. For the same reason we disable it for
SQPOLL, and for IOPOLL as it won't benefit from it in any case.
DEFER_TASKRUN, SQPOLL and IOPOLL requirement may be relaxed in the
future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8c91fd82cfcdcc1d2e5bac7051fe2c183bda73.1670384893.git.asml.silence@gmail.com
[axboe: modify to apply]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds ctx->task_complete flag. If set, we'll complete all
requests in the context of the original task. Note, this extends to
completion CQE posting only but not io_kiocb cleanup / free, e.g. io-wq
may free the requests in the free calllback. This flag will be used
later for optimisations purposes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/21ece72953f76bb2e77659a72a14326227ab6460.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only way to fail overflowed CQEs flush is for CQ to be fully packed.
There is one place checking for flush failures, i.e. io_cqring_wait(),
but we limit the number to be waited for by the CQ size, so getting a
failure automatically means that we're done with waiting.
Don't check for failures, rarely but they might spuriously fail CQ
waiting with -EBUSY.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b720a45c03345655517f8202cbd0bece2848fb2.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After io_ring_ctx_wait_and_kill() is called there should be no users
poking into rings and so there is no need to post CQEs. So, instead of
trying to post overflowed CQEs into the CQ, drop them. Also, do it
in io_ring_exit_work() in a loop to reduce the number of contexts it
can be executed from and even when it struggles to quiesce the ring we
won't be leaving memory allocated for longer than needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26d13751155a735a3029e24f8d9ca992f810419d.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Syzkaller reports a NULL deref bug as follows:
BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
Read of size 4 at addr 0000000000000138 by task file1/1955
CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
? io_tctx_exit_cb+0x53/0xd3
kasan_report+0xbb/0x1f0
? io_tctx_exit_cb+0x53/0xd3
kasan_check_range+0x140/0x190
io_tctx_exit_cb+0x53/0xd3
task_work_run+0x164/0x250
? task_work_cancel+0x30/0x30
get_signal+0x1c3/0x2440
? lock_downgrade+0x6e0/0x6e0
? lock_downgrade+0x6e0/0x6e0
? exit_signals+0x8b0/0x8b0
? do_raw_read_unlock+0x3b/0x70
? do_raw_spin_unlock+0x50/0x230
arch_do_signal_or_restart+0x82/0x2470
? kmem_cache_free+0x260/0x4b0
? putname+0xfe/0x140
? get_sigframe_size+0x10/0x10
? do_execveat_common.isra.0+0x226/0x710
? lockdep_hardirqs_on+0x79/0x100
? putname+0xfe/0x140
? do_execveat_common.isra.0+0x238/0x710
exit_to_user_mode_prepare+0x15f/0x250
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x42/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0023:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Kernel panic - not syncing: panic_on_warn set ...
This happens because the adding of task_work from io_ring_exit_work()
isn't synchronized with canceling all work items from eg exec. The
execution of the two are ordered in that they are both run by the task
itself, but if io_tctx_exit_cb() is queued while we're canceling all
work items off exec AND gets executed when the task exits to userspace
rather than in the main loop in io_uring_cancel_generic(), then we can
find current->io_uring == NULL and hit the above crash.
It's safe to add this NULL check here, because the execution of the two
paths are done by the task itself.
Cc: stable@vger.kernel.org
Fixes: d56d938b4b ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com
[axboe: add code comment and also put an explanation in the commit msg]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is one newly added place when we lock ring with io_cq_lock() but
unlocking is hand coded calling spin_unlock directly. It's ugly and
troublesome in the long run. Make it consistent with the other completion
locking.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ca4f0564492b90214a190cd5b2a6c76522de138.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This partially reverts
6c16fe3c16 ("io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()")
The redundancy of __io_cq_unlock_post() was always to keep it inlined
into __io_submit_flush_completions(). Inline it back and rename with
hope of clarifying the intention behind it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/372a16c485fca44c069be2e92fc5e7332a1d7fd7.1669310258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Until now there was no reason for multishot polled requests to defer
completions as there was no functional difference. However now this will
actually defer the completions, for a performance win.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-10-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only call sites which would not allow overflow are also call sites
which would use the io_aux_cqe as they care about ordering.
So remove this parameter from io_post_aux_cqe.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-9-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the just introduced deferred post cqe completion state when possible
in io_aux_cqe. If not possible fallback to io_post_aux_cqe.
This introduces a complication because of allow_overflow. For deferred
completions we cannot know without locking the completion_lock if it will
overflow (and even if we locked it, another post could sneak in and cause
this cqe to be in overflow).
However since overflow protection is mostly a best effort defence in depth
to prevent infinite loops of CQEs for poll, just checking the overflow bit
is going to be good enough and will result in at most 16 (array size of
deferred cqes) overflows.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Multishot ops cannot use the compl_reqs list as the request must stay in
the poll list, but that means they need to run each completion without
benefiting from batching.
Here introduce batching infrastructure for only small (ie 16 byte)
CQEs. This restriction is ok because there are no use cases posting 32
byte CQEs.
In the ring keep a batch of up to 16 posted results, and flush in the same
way as compl_reqs.
16 was chosen through experimentation on a microbenchmark ([1]), as well
as trying not to increase the size of the ring too much. This increases
the size to 1472 bytes from 1216.
[1]: 9ac66b36bc
Run with $ make -j && ./benchmark/reg.b -s 1 -t 2000 -r 10
Gives results:
baseline 8309 k/s
8 18807 k/s
16 19338 k/s
32 20134 k/s
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-5-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All failures happen under lock now, and can be deferred. To be consistent
when the failure has happened after some multishot cqe has been
deferred (and keep ordering), always defer failures.
To make this obvious at the caller (and to help prevent a future bug)
rename io_req_complete_failed to io_req_defer_failed.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_cq_unlock_post() is identical to io_cq_unlock_post(), and
io_cqring_ev_posted() has a single caller so migth as well just inline
it there.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related
wakups, so that we can check for a circular event dependency between
eventfd and epoll. If this flag is set when our wakeup handlers are
called, then we know we have a dependency that needs to terminate
multishot requests.
eventfd and epoll are the only such possible dependencies.
Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the target process is dying and so task_work_add() is not allowed
we push all task_work item to the fallback workqueue. Move the part
responsible for moving tw items out of __io_req_task_work_add() into
a separate function. Makes it a bit cleaner and gives the compiler a bit
of extra info.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e503dab9d7af95470ca6b214c6de17715ae4e748.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_req_task_work_add() is huge but marked inline, that makes compilers
to generate lots of garbage. Inline the wrapper caller
io_req_task_work_add() instead.
before and after:
text data bss dec hex filename
47347 16248 8 63603 f873 io_uring/io_uring.o
text data bss dec hex filename
45303 16248 8 61559 f077 io_uring/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26dc8c28ca0160e3269ef3e55c5a8b917c4d4450.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Running task work when not needed can unnecessarily delay
operations. Specifically IORING_SETUP_DEFER_TASKRUN tries to avoid running
task work until the user requests it. Therefore do not run it in
io_uring_register any more.
The one catch is that io_rsrc_ref_quiesce expects it to have run in order
to process all outstanding references, and so reorder it's loop to do this.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107123349.4106213-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having REQ_F_POLLED set doesn't guarantee that the request is
executed as a multishot from the polling path. Fortunately for us, if
the code thinks it's multishot issue when it's not, it can only ask to
skip completion so leaking the request. Use issue_flags to mark
multipoll issues.
Cc: stable@vger.kernel.org
Fixes: 390ed29b5e ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7700ac57653f2823e30b34dc74da68678c0c5f13.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_cqring_wait (and it's wake function io_has_work) used cached_cq_tail in
order to calculate the number of CQEs. cached_cq_tail is set strictly
before the user visible rings->cq.tail
However as far as userspace is concerned, if io_uring_enter(2) is called
with a minimum number of events, they will verify by checking
rings->cq.tail.
It is therefore possible for io_uring_enter(2) to return early with fewer
events visible to the user.
Instead make the wait functions read from the user visible value, so there
will be no discrepency.
This is triggered eventually by the following reproducer:
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
unsigned int cqe_ready;
struct io_uring ring;
int ret, i;
ret = io_uring_queue_init(N, &ring, 0);
assert(!ret);
while(true) {
for (i = 0; i < N; i++) {
sqe = io_uring_get_sqe(&ring);
io_uring_prep_nop(sqe);
sqe->flags |= IOSQE_ASYNC;
}
ret = io_uring_submit(&ring);
assert(ret == N);
do {
ret = io_uring_wait_cqes(&ring, &cqe, N, NULL, NULL);
} while(ret == -EINTR);
cqe_ready = io_uring_cq_ready(&ring);
assert(!ret);
assert(cqe_ready == N);
io_uring_cq_advance(&ring, N);
}
Fixes: ad3eb2c89f ("io_uring: split overflow state into SQ and CQ side")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221108153016.1854297-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is possible for tw to lock the ring, and this was not propogated out to
io_run_local_work. This can cause an unlock to be missed.
Instead pass a pointer to locked into __io_run_local_work.
Fixes: 8ac5d85a89 ("io_uring: add local task_work run helper that is entered locked")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221027144429.3971400-3-dylany@meta.com
[axboe: WARN_ON() -> WARN_ON_ONCE() and add a minor comment]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not be completing requests from a task context that has already
undergone io_uring cancellations, i.e. __io_uring_cancel(), as there are
some assumptions, e.g. around cached task refs draining. Remove
iopolling from io_ring_ctx_wait_and_kill() as it can be called later
after PF_EXITING is set with the last task_work run.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c03cc91455c4a1af49c6b9cbda4e57ea467aa11.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We test file_table.bitmap in io_file_get_fixed() to check invariants,
don't do it, it's expensive and was showing up in profiles. No reports of
this triggering has come in. Move the check to the file clear instead,
which will still catch any wrong usage.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cf77f2ded68d2e5b2bc7355784d969837d48e023.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
THe lifetime of SCM'ed files is bound to ring_sock, which is destroyed
strictly after we're done with registered file tables. This means there
is no need for the FFS_SCM hack, which was not available on 32-bit builds
anyway.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/984226a1045adf42dc35d8bd7fb5a8bbfa472ce1.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Running local task_work requires taking uring_lock, for submit + wait we
can try to run them right after submit while we still hold the lock and
save one lock/unlokc pair. The optimisation was implemented in the first
local tw patches but got dropped for simplicity.
Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/281fc79d98b5d91fe4778c5137a17a2ab4693e5c.1665088876.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_cqring_wake() needs a barrier for the waitqueue_active() check.
However, in the case of io_req_local_work_add(), we call llist_add()
first, which implies an atomic. Hence we can replace smb_mb() with
smp_mb__after_atomic().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/43983bc8bc507172adda7a0f00cab1aff09fd238.1665018309.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SETUP_SINGLE_ISSUER restricts what tasks can submit requests.
Extend it to registration as well, so non-owning task can't do
registrations. It's not necessary at the moment but might be useful in
the future.
Cc: <stable@vger.kernel.org> # 6.0
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52a6a9c8a8990d4a831f73c0571e7406aac2bba.1664237592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove submitter parameter from __io_uring_add_tctx_node.
It was only called from one place, and we can do that logic in that one
place.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
/a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
vQNj/iOBQA==
=R05e
-----END PGP SIGNATURE-----
Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Christoph:
- handle number of queue changes in the TCP and RDMA drivers
(Daniel Wagner)
- allow changing the number of queues in nvmet (Daniel Wagner)
- also consider host_iface when checking ip options (Daniel
Wagner)
- don't map pages which can't come from HIGHMEM (Fabio M. De
Francesco)
- avoid unnecessary flush bios in nvmet (Guixin Liu)
- shrink and better pack the nvme_iod structure (Keith Busch)
- add comment for unaligned "fake" nqn (Linjun Bao)
- print actual source IP address through sysfs "address" attr
(Martin Belanger)
- various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
- handle effects after freeing the request (Keith Busch)
- copy firmware_rev on each init (Keith Busch)
- restrict management ioctls to admin (Keith Busch)
- ensure subsystem reset is single threaded (Keith Busch)
- report the actual number of tagset maps in nvme-pci (Keith
Busch)
- small fabrics authentication fixups (Christoph Hellwig)
- add common code for tagset allocation and freeing (Christoph
Hellwig)
- stop using the request_queue in nvmet (Christoph Hellwig)
- set min_align_mask before calculating max_hw_sectors (Rishabh
Bhatnagar)
- send a rediscover uevent when a persistent discovery controller
reconnects (Sagi Grimberg)
- misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)
- MD pull request via Song:
- Various raid5 fix and clean up, by Logan Gunthorpe and David
Sloan.
- Raid10 performance optimization, by Yu Kuai.
- sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)
- IO scheduler switching quisce fix (Keith)
- s390/dasd block driver updates (Stefan)
- support for recovery for the ublk driver (ZiyangZhang)
- rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)
- blk-mq and null_blk map fixes (Bart)
- various bcache fixes (Coly, Jilin, Jules)
- nbd signal hang fix (Shigeru)
- block writeback throttling fix (Yu)
- optimize the passthrough mapping handling (me)
- prepare block cgroups to being gendisk based (Christoph)
- get rid of an old PSI hack in the block layer, moving it to the
callers instead where it belongs (Christoph)
- blk-throttle fixes and cleanups (Yu)
- misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
Miaohe, Bart, Coly, Gaosheng
* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
sbitmap: fix lockup while swapping
block: add rationale for not using blk_mq_plug() when applicable
block: adapt blk_mq_plug() to not plug for writes that require a zone lock
s390/dasd: use blk_mq_alloc_disk
blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
nvmet: don't look at the request_queue in nvmet_bdev_set_limits
nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
blk-mq: use quiesced elevator switch when reinitializing queues
block: replace blk_queue_nowait with bdev_nowait
nvme: remove nvme_ctrl_init_connect_q
nvme-loop: use the tagset alloc/free helpers
nvme-loop: store the generic nvme_ctrl in set->driver_data
nvme-loop: initialize sqsize later
nvme-fc: use the tagset alloc/free helpers
nvme-fc: store the generic nvme_ctrl in set->driver_data
nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
nvme-rdma: use the tagset alloc/free helpers
nvme-rdma: store the generic nvme_ctrl in set->driver_data
nvme-tcp: use the tagset alloc/free helpers
nvme-tcp: store the generic nvme_ctrl in set->driver_data
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9
Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd
+3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A
A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD
hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w
PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+
5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM
fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU
OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN
oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ
W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4
9bWG6BM7Yw==
=m9kX
-----END PGP SIGNATURE-----
Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Add supported for more directly managed task_work running.
This is beneficial for real world applications that end up issuing
lots of system calls as part of handling work. Normal task_work will
always execute as we transition in and out of the kernel, even for
"unrelated" system calls. It's more efficient to defer the handling
of io_uring's deferred work until the application wants it to be run,
generally in batches.
As part of ongoing work to write an io_uring network backend for
Thrift, this has been shown to greatly improve performance. (Dylan)
- Add IOPOLL support for passthrough (Kanchan)
- Improvements and fixes to the send zero-copy support (Pavel)
- Partial IO handling fixes (Pavel)
- CQE ordering fixes around CQ ring overflow (Pavel)
- Support sendto() for non-zc as well (Pavel)
- Support sendmsg for zerocopy (Pavel)
- Networking iov_iter fix (Stefan)
- Misc fixes and cleanups (Pavel, me)
* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
io_uring/net: fix notif cqe reordering
io_uring/net: don't update msg_name if not provided
io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
io_uring/rw: defer fsnotify calls to task context
io_uring/net: fix fast_iov assignment in io_setup_async_msg()
io_uring/net: fix non-zc send with address
io_uring/net: don't skip notifs for failed requests
io_uring/rw: don't lose short results on io_setup_async_rw()
io_uring/rw: fix unexpected link breakage
io_uring/net: fix cleanup double free free_iov init
io_uring: fix CQE reordering
io_uring/net: fix UAF in io_sendrecv_fail()
selftest/net: adjust io_uring sendzc notif handling
io_uring: ensure local task_work marks task as running
io_uring/net: zerocopy sendmsg
io_uring/net: combine fail handlers
io_uring/net: rename io_sendzc()
io_uring/net: support non-zerocopy sendto
io_uring/net: refactor io_setup_async_addr
io_uring/net: don't lose partial send_zc on fail
...
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of picking the task from the first submitter task, rather use the
creator task or in the case of disabled (IORING_SETUP_R_DISABLED) the
enabling task.
This approach allows a lot of simplification of the logic here. This
removes init logic from the submission path, which can always be a bit
confusing, but also removes the need for locking to write (or read) the
submitter_task.
Users that want to move a ring before submitting can create the ring
disabled and then enable it on the submitting task.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.
One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.
Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.
Cc: stable@vger.kernel.org
Fixes: e98e49b2bb ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We may propagate a positive return value of io_run_task_work() out of
io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't
return anything useful for us, ignore the return value.
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set,
but if nothing has been submitted yet it'll allow any waiter, which
violates the contract.
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of DEFER_TASK_WORK we try to restrict waiters to only one task,
which is also the only submitter; however, we don't do it reliably,
which might be very confusing and backfire in the future. E.g. we
currently allow multiple tasks in io_iopoll_check().
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Combine the two checks we have for task_work running and whether or not
we need to shuffle the mutex into one, so we unify how task_work is run
in the iopoll loop. This helps ensure that local task_work is run when
needed, and also optimizes that path to avoid a mutex shuffle if it's
not needed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have a few spots that drop the mutex just to run local task_work,
which immediately tries to grab it again. Add a helper that just passes
in whether we're locked already.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.
In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.
This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com
[axboe: fold in call_rcu() serialization fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Non functional change: move this function above io_eventfd_signal so it
can be used from there
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.
Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.
For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.
The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.
This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.
There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.
[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100"
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is not needed, and it is normally better to wait for task work until
after submissions. This will allow greater batching if either work arrives
in the meanwhile, or if the submissions cause task work to be queued up.
For SQPOLL this also no longer runs task work, but this is handled inside
the SQPOLL loop anyway.
For IOPOLL io_iopoll_check will run task work anyway
And otherwise io_cqring_wait will run task work
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will be used later to know if the ring has outstanding work. Right
now just if there is overflow CQEs to copy to the main CQE ring, but later
will include deferred tasks
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we queue a request via tw for execution it's not going to be
executed immediately, so when io_queue_async() hits IO_APOLL_READY
and queues a tw but doesn't try to recycle/consume the buffer some other
request may try to use the the buffer.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a19bc9e211e3184215a58e129b62f440180e9212.1662480490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Following user feedback, this patch simplifies zerocopy send API. One of
the main complaints is that the current API is difficult with the
userspace managing notification slots, and then send retries with error
handling make it even worse.
Instead of keeping notification slots change it to the per-request
notifications model, which posts both completion and notification CQEs
for each request when any data has been sent, and only one CQE if it
fails. All notification CQEs will have IORING_CQE_F_NOTIF set and
IORING_CQE_F_MORE in completion CQEs indicates whether to wait a
notification or not.
IOSQE_CQE_SKIP_SUCCESS is disallowed with zerocopy sends for now.
This is less flexible, but greatly simplifies the user API and also the
kernel implementation. We reuse notif helpers in this patch, but in the
future there won't be need for keeping two requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/95287640ab98fc9417370afb16e310677c63e6ce.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are opcodes that need ->async_data only in some cases and
allocation it unconditionally may hurt performance. Add an option to
opdef to make move the allocation part from the core io_uring to opcode
specific code.
Note, we can't just set opdef->async_size to zero because there are
other helpers that rely on it, e.g. io_alloc_async_data().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dc62be9e88dd0ed63c48365340e8922d2498293.1661342812.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want to use all optimisations that we have for io_uring requests like
completion batching, memory caching and more but for zc notifications.
Fortunately, notification perfectly fit the request model so we can
overlay them onto struct io_kiocb and use all the infratructure.
Most of the fields of struct io_notif natively fits into io_kiocb, so we
replace struct io_notif with struct io_kiocb carrying struct
io_notif_data in the cmd cache line. Then we adapt io_alloc_notif() to
use io_alloc_req()/io_alloc_req_refill(), and kill leftovers of hand
coded caching. __io_notif_complete_tw() is converted to use io_uring's
tw infra.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e010125175e80baf51f0ca63bdc7cc6a4a9fa56.1658913593.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kmalloc'ing struct io_notif is too expensive when done frequently, cache
them as many other resources in io_uring. Keep two list, the first one
is from where we're getting notifiers, it's protected by ->uring_lock.
The second is protected by ->completion_lock, to which we queue released
notifiers. Then we splice one list into another when needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dec18f7fcbab9f4bd40b96e5ae158b119945230.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.
The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.
When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're offloading requests directly to io-wq because IOSQE_ASYNC was
set in the sqe, we can miss hashing writes appropriately because we
haven't set REQ_F_ISREG yet. This can cause a performance regression
with buffered writes, as io-wq then no longer correctly serializes writes
to that file.
Ensure that we set the flags in io_prep_async_work(), which will cause
the io-wq work item to be hashed appropriately.
Fixes: 584b0180f0 ("io_uring: move read/write file prep state into actual opcode handler")
Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit 8bb649ee1d ("io_uring: remove ring quiesce for
io_uring_register") removed the worklow relying on reinit/resurrection
of the percpu_ref, hence, initialization with that requested is a relic.
This is based on code review, this causes no real bug (and theoretically
can't). Technically it's a revert of commit 214828962d ("io_uring:
initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since
the flag omission is now justified, I'm not making this a revert.
Fixes: 8bb649ee1d ("io_uring: remove ring quiesce for io_uring_register")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For recvmsg/sendmsg, if they don't complete inline, we currently need
to allocate a struct io_async_msghdr for each request. This is a
somewhat large struct.
Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the
alloc + free overhead considerably, yielding 4-5% of extra performance
running netbench.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caches like this tend to grow to the peak size, and then never get any
smaller. Impose a max limit on the size, to prevent it from growing too
big.
A somewhat randomly chosen 512 is the max size we'll allow the cache
to get. If a batch of frees come in and would bring it over that, we
simply start kfree'ing the surplus.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for adding limits, and one more user, abstract out the
core bits of the allocation+free cache.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From recently io_uring provides an option to allocate a file index for
operation registering fixed files. However, it's utterly unusable with
mixed approaches when for a part of files the userspace knows better
where to place it, as it may race and users don't have any sane way to
pick a slot and hoping it will not be taken.
Let the userspace to register a range of fixed file slots in which the
auto-allocation happens. The use case is splittting the fixed table in
two parts, where on of them is used for auto-allocation and another for
slot-specified operations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_enter() takes ctx->refs, which was previously preventing racing
with register quiesce. However, as register now doesn't touch the refs,
we can freely kill extra ctx pinning and rely on the fact that we're
holding a file reference preventing the ring from being destroyed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.
However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.
Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:
0 Request found and canceled fine.
> 0 Requests found and canceled. Only happens if asked to
cancel multiple requests, and if the work wasn't in
progress.
-ENOENT Request not found.
-ETIME A timeout on the operation was requested, but the timeout
expired before we could cancel.
and we won't get -EALREADY via this API.
If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.
Link: https://github.com/axboe/liburing/discussions/608
Signed-off-by: Jens Axboe <axboe@kernel.dk>
trace task_work_run to help provide stats on how often task work is run
and what batch sizes are coming through.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-9-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Batching task work up is an important performance optimisation, as
task_work_add is expensive.
In order to keep the semantics replace the task_list with a fake node
while processing the old list, and then do a cmpxchg at the end to see if
there is more work.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With networking use cases we see contention on the spinlock used to
protect the task_list when multiple threads try and add completions at once.
Instead we can use a lockless list, and assume that the first caller to
add to the list is responsible for kicking off task work.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This optimisation has some built in assumptions that make it easy to
introduce bugs. It also does not have clear wins that make it worth keeping.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
spin_lock(&ctx->completion_lock);
/* post CQEs */
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);
We have many places repeating this sequence, and the three function
unlock section is not perfect from the maintainance perspective and also
makes it harder to add new locking/sync trick.
Introduce two helpers. io_cq_lock(), which is simple and only grabs
->completion_lock, and io_cq_unlock_post() encapsulating the three call
section.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fe0c682bf7f7b55d9be55b0d034be9c1949277dc.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some io_uring-eventfd users assume that there won't be spurious wakeups.
That assumption has to be honoured by all io_cqring_ev_posted() callers,
which is inconvenient and from time to time leads to problems but should
be maintained to not break the userspace.
Instead of making the callers track whether a CQE was posted or not, hide
it inside io_eventfd_signal(). It saves ->cached_cq_tail it saw last time
and triggers the eventfd only when ->cached_cq_tail changed since then.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ffc66bae37a2513080b601e4370e147faaa72c5.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_try_cancel_requests() loops until there is nothing left to do
with the ring, however there might be several rings and they might have
dependencies between them, e.g. via poll requests.
Instead of cancelling rings one by one, try to cancel them all and only
then loop over if we still potenially some work to do.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8d491fe02d8ac4c77ff38061cf86b9a827e8845c.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's not clear how widely used IOSQE_CQE_SKIP_SUCCESS is, and how often
->flush_cqes flag prevents from completion being flushed. Sometimes it's
high level of concurrency that enables it at least for one CQE, but
sometimes it doesn't save much because nobody waiting on the CQ.
Remove ->flush_cqes flag and the optimisation, it should benefit the
normal use case. Note, that there is no spurious eventfd problem with
that as checks for spuriousness were incorporated into
io_eventfd_signal().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/692e81eeddccc096f449a7960365fa7b4a18f8e6.1655637157.git.asml.silence@gmail.com
[axboe: remove now dead state->flush_cqes variable]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have lots of trace events accepting an io_uring request and wanting
to print some of its fields like user_data, opcode, flags and so on.
However, as trace points were unaware of io_uring structures, we had to
pass all the fields as arguments. Teach trace/events/io_uring.h about
struct io_kiocb and stop the misery of passing a horde of arguments to
trace helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/40ff72f92798114e56d400f2b003beb6cde6ef53.1655384063.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_fill_cqe_req() is hot and inlined, we want it to be as small as
possible. Add io_req_cqe_overflow() accepting only a request and doing
all overflow accounting, and replace with it two calls to 6 argument
io_cqring_event_overflow().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/048b9fbcce56814d77a1a540409c98c3d383edcb.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we do two extra spin lock/unlock pairs to add a poll/apoll
request to the cancellation hash table and remove it from there.
On the submission side we often already hold ->uring_lock and tw
completion is likely to hold it as well. Add a second cancellation hash
table protected by ->uring_lock. In concerns for latency because of a
need to have the mutex locked on the completion side, use the new table
only in following cases:
1) IORING_SETUP_SINGLE_ISSUER: only one task grabs uring_lock, so there
is little to no contention and so the main tw hander will almost
always end up grabbing it before calling callbacks.
2) IORING_SETUP_SQPOLL: same as with single issuer, only one task is
a major user of ->uring_lock.
3) apoll: we normally grab the lock on the completion side anyway to
execute the request, so it's free.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bbad9c78c454b7b92f100bbf46730a37df7194f.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new IORING_SETUP_SINGLE_ISSUER flag and the userspace visible part
of it, i.e. put limitations of submitters. Also, don't allow it together
with IOPOLL as we're not going to put it to good use.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4bcc41ee467fdf04c8aab8baf6ce3ba21858c3d4.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>