OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Hao Xu	0c6e1d7fd5	io_uring: don't free request to slab It's not necessary to free the request back to slab when we fail to get sqe, just move it to state->free_list. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 13:04:26 -06:00
Pavel Begunkov	aaa4db12ef	io_uring: accept directly into fixed file table As done with open opcodes, allow accept to skip installing fd into processes' file tables and put it directly into io_uring's fixed file table. Same restrictions and design as for open. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Pavel Begunkov	a7083ad5e3	io_uring: hand code io_accept() fd installing Make io_accept() to handle file descriptor allocations and installation. A preparation patch for bypassing file tables. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Pavel Begunkov	b9445598d8	io_uring: openat directly into fixed fd table Instead of opening a file into a process's file table as usual and then registering the fd within io_uring, some users may want to skip the first step and place it directly into io_uring's fixed file table. This patch adds such a capability for IORING_OP_OPENAT and IORING_OP_OPENAT2. The behaviour is controlled by setting sqe->file_index, where 0 implies the old behaviour using normal file tables. If non-zero value is specified, then it will behave as described and place the file into a fixed file slot sqe->file_index - 1. A file table should be already created, the slot should be valid and empty, otherwise the operation will fail. Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and EINVAL on inappropriate fixed tables, and return EBADF on collision with already registered file. Note: IOSQE_FIXED_FILE can't be used to switch between modes, because accept takes a file, and it already uses the flag with a different meaning. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Dmitry Kadashev	cf30da90bc	io_uring: add support for IORING_OP_LINKAT IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:48:52 -06:00
Dmitry Kadashev	7a8721f84f	io_uring: add support for IORING_OP_SYMLINKAT IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:48:33 -06:00
Jens Axboe	394918ebb8	io_uring: enable use of bio alloc cache Mark polled IO as being safe for dipping into the bio allocation cache, in case the targeted bio_set has it enabled. This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to ~3.5M IOPS. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:44:55 -06:00
Pavel Begunkov	dadebc350d	io_uring: fix io_try_cancel_userdata race for iowq WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 Call Trace: io_async_cancel fs/io_uring.c:6014 [inline] io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407 io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511 io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533 io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io_try_cancel_userdata() can be called from io_async_cancel() executing in the io-wq context, so the warning fires, which is there to alert anyone accessing task->io_uring->io_wq in a racy way. However, io_wq_put_and_exit() always first waits for all threads to complete, so the only detail left is to zero tctx->io_wq after the context is removed. note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor won't touch ->io_wq, because io_wq_destroy() might cancel left pending requests in such a way. Cc: stable@vger.kernel.org Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:56 -06:00
Dmitry Kadashev	e34a02dc40	io_uring: add support for IORING_OP_MKDIRAT IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-10-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Pavel Begunkov	126180b95f	io_uring: IRQ rw completion batching Employ inline completion logic for read/write completions done via io_req_task_complete(). If ->uring_lock is contended, just do normal request completion, but if not, make tctx_task_work() to grab the lock and do batched inline completions in io_req_task_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	f237c30a56	io_uring: batch task work locking Many task_work handlers either grab ->uring_lock, or may benefit from having it. Move locking logic out of individual handlers to a lazy approach controlled by tctx_task_work(), so we don't keep doing tons of mutex lock/unlock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	5636c00d3e	io_uring: flush completions for fallbacks io_fallback_req_func() doesn't expect anyone creating inline completions, and no one currently does that. Teach the function to flush completions preparing for further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	26578cda3d	io_uring: add ->splice_fd_in checks ->splice_fd_in is used only by splice/tee, but no other request checks it for validity. Add the check for most of request types excluding reads/writes/sends/recvs, we don't want overhead for them and can leave them be as is until the field is actually used. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:00 -06:00
Jens Axboe	2c5d763c19	io_uring: add clarifying comment for io_cqring_ev_posted() We've previously had an issue where overflow flush unconditionally calls io_cqring_ev_posted() even if it didn't flush any events to the ring, causing wake and eventfd increment where no new events are available. Some applications don't like that, see commit `b18032bb0a` for details. This came up in discussion for another patch recently, hence add a comment detailing what the relationship between calling the events posted helper and CQ ring entries is. Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:47 -06:00
Pavel Begunkov	0bea96f59b	io_uring: place fixed tables under memcg limits Fixed tables may be large enough, place all of them together with allocated tags under memcg limits. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:47 -06:00
Pavel Begunkov	3a1b8a4e84	io_uring: limit fixed table size by RLIMIT_NOFILE Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE, that's the first and the simpliest restriction that we should impose. Cc: stable@vger.kernel.org Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Hao Xu	99c8bc52d1	io_uring: fix lack of protection for compl_nr coml_nr in ctx_flush_and_put() is not protected by uring_lock, this may cause problems when accessing in parallel: say coml_nr > 0 ctx_flush_and put other context if (compl_nr) get mutex coml_nr > 0 do flush coml_nr = 0 release mutex get mutex do flush () release mutex in () place, we call io_cqring_ev_posted() and users likely get no events there. To avoid spurious events, re-check the value when under the lock. Fixes: `2c32395d81` ("io_uring: fix __tctx_task_work() ctx race") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
wangyangbo	187f08c12c	io_uring: Add register support for non-4k PAGE_SIZE Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and accessing this table relies on each level index from fixed TABLE_SHIFT (12 - 3) in 4k page case. In order to correctly work in non-4k page, define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for 2nd-level table entry number. Signed-off-by: wangyangbo <wangyangbo@uniontech.com> Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Pavel Begunkov	e98e49b2bb	io_uring: extend task put optimisations Now with IRQ completions done via IRQ, almost all requests freeing are done from the context of submitter task, so it makes sense to extend task_put optimisation from io_req_free_batch_finish() to cover all the cases including task_work by moving it into io_put_task(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Jens Axboe	316319e82f	io_uring: add comments on why PF_EXITING checking is safe We have two checks of task->flags & PF_EXITING left: 1) In io_req_task_submit(), which is called in task_work and hence always in the context of the original task. That means that req->task == current, and hence checking ->flags is totally fine. 2) In io_poll_rewait(), where we need to stop re-arming poll to prevent it interfering with cancelation. This is only run from task_work as well, and hence for this case too req->task == current. Add a comment to both spots detailing that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	ec3c3d0f3a	io_uring: fix io_timeout_remove locking io_timeout_cancel() posts CQEs so needs ->completion_lock to be held, so grab it in io_timeout_remove(). Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	23a65db83b	io_uring: improve same wq polling Move earlier the check for whether __io_queue_proc() tries to poll already polled waitqueue, and do the same for the second poll entry, if any. Shouldn't really matter, but at least it would have a more predictable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	505657bc6c	io_uring: reuse io_req_complete_post() We have io_req_complete_post() to post a CQE and put the request. It takes care of all synchronisation and is more concise and efficent, so replace all hancoded occurrences of "lock; post CQE; unlock; + put_req()" with io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	ae421d9350	io_uring: better encapsulate buffer select for rw Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the callers don't need to hand code it. The number of places where we call io_put_rw_kbuf() is growing, so saves some pain. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	906c6caaf5	io_uring: optimise io_prep_linked_timeout() Linked timeout handling during issuing is heavy, it adds extra instructions and forces to save the next linked timeout before io_issue_sqe(). Follwing the same reasoning as in refcounting patches, a request can't be freed by the time it returns from io_issue_sqe(), so now we don't need to do io_prep_linked_timeout() in advance, and it can be delayed to colder paths optimising the generic path. Also, it should also save quite a lot for requests with linked timeouts and completed inline on timeout spinlocking + hrtimer_start() + hrtimer_try_to_cancel() and so on. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	0756a86910	io_uring: cancel not-armed linked touts separately Adjust io_disarm_next(), so it can detect if there is a linked but not-yet-armed timeout and complete/cancel it separately. Will be used in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	4d13d1a4d1	io_uring: simplify io_prep_linked_timeout The link test in io_prep_linked_timeout() is pretty bulky, replace it with a flag. It's better for normal path and linked requests, and also will be used further for request failing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	b97e736a4b	io_uring: kill REQ_F_LTIMEOUT_ACTIVE Instead of handling double consecutive linked timeouts through tricky flag combinations, just check the submit_state.link during timeout_prep and fail that case in advance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	fd08e5309b	io_uring: optimise hot path of ltimeout prep io_prep_linked_timeout() grew too heavy and compiler now refuse to inline the function. Help it by splitting in two and annotating with inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	8cb01fac98	io_uring: deduplicate cancellation code IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of overlap, so extract a helper for request cancellation and use in both. Also, removes some amount of ugliness because of success_ret. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	a8576af9d1	io_uring: kill not necessary resubmit switch `773af69121` ("io_uring: always reissue from task_work context") makes all resubmission to be made from task_work, so we don't need that hack with resubmit/not-resubmit switch anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	fb6820998f	io_uring: optimise initial ltimeout refcounting Linked timeouts are never refcounted when it comes to the first call to __io_prep_linked_timeout(), so save an io_ref_get() and set the desired value directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	761bcac157	io_uring: don't inflight-track linked timeouts Tracking linked timeouts as infligh was needed to make sure that io-wq is not destroyed by io_uring_cancel_generic() racing with io_async_cancel_one() accessing it. Now, cancellations issued by linked timeouts are done in the task context, so it's already synchronised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	48dcd38d73	io_uring: optimise iowq refcounting If a requests is forwarded into io-wq, there is a good chance it hasn't been refcounted yet and we can save one req_ref_get() by setting the refcount number to the right value directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Jens Axboe	a141dd896f	io_uring: correct __must_hold annotation io_req_free_batch() has a __must_hold annotation referencing a request being passed in, but we're passing in the context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Hao Xu	41a5169c23	io_uring: code clean for completion_lock in io_arm_poll_handler() We can merge two spin_unlock() operations to one since we removed some code not long ago. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Hao Xu	f552a27afe	io_uring: remove files pointer in cancellation functions When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	20e60a3832	io_uring: skip request refcounting As submission references are gone, there is only one initial reference left. Instead of actually doing atomic refcounting, add a flag indicating whether we're going to take more refs or doing any other sync magic. The flag should be set before the request may get used in parallel. Together with the previous patch it saves 2 refcount atomics per request for IOPOLL and IRQ completions, and 1 atomic per req for inline completions, with some exceptions. In particular, currently, there are three cases, when the refcounting have to be enabled: - Polling, including apoll. Because double poll entries takes a ref. Might get relaxed in the near future. - Link timeouts, enabled for both, the timeout and the request it's bound to, because they work in-parallel and we need to synchronise to cancel one of them on completion. - When a request gets in io-wq, because it doesn't hold uring_lock and we need guarantees of submission references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d5901a343	io_uring: remove submission references Requests are by default given with two references, submission and completion. Completion references are straightforward, they represent request ownership and are put when a request is completed or so. Submission references are a bit more trickier. They're needed when io_issue_sqe() followed deep into the submission stack (e.g. in fs, block, drivers, etc.), request may have given away for concurrent execution or already completed, and the code unwinding back to io_issue_sqe() may be accessing some pieces of our requests, e.g. file or iov. Now, we prevent such async/in-depth completions by pushing requests through task_work. Punting to io-wq is also done through task_works, apart from a couple of cases with a pretty well known context. So, there're two cases: 1) io_issue_sqe() from the task context and protected by ->uring_lock. Either requests return back to io_uring or handed to task_work, which won't be executed because we're currently controlling that task. So, we can be sure that requests are staying alive all the time and we don't need submission references to pin them. 2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of submission reference is played by io-wq reference, which is put by io_wq_submit_work(). Hence, it should be fine. Considering that, we can carefully kill the submission reference. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	91c2f69783	io_uring: remove req_ref_sub_and_test() Soon, we won't need to put several references at once, remove req_ref_sub_and_test() and @nr argument from io_put_req_deferred(), and put the rest of the references by hand. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	21c843d582	io_uring: move req_ref_get() and friends Move all request refcount helpers to avoid forward declarations in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	79ebeaee8a	io_uring: remove IRQ aspect of io_ring_ctx completion lock We have no hard/soft IRQ users of this lock left, remove any IRQ disabling/saving and restoring when grabbing this lock. This is straight forward with no users entering with IRQs disabled anymore, the only thing to look out for is the waitqueue poll head lock which nests inside the completion lock. That needs IRQs disabled, and hence we have to do that now instead of relying on the outer lock doing so. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	8ef12efe26	io_uring: run regular file completions from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	89b263f6d5	io_uring: run linked timeouts from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	89850fce16	io_uring: run timeouts from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Add a timeout_lock to handle the ordering of timeout completions or cancelations with the timeouts actually triggering. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	62906e89e6	io_uring: remove file batch-get optimisation For requests with non-fixed files, instead of grabbing just one reference, we get by the number of left requests, so the following requests using the same file can take it without atomics. However, it's not all win. If there is one request in the middle not using files or having a fixed file, we'll need to put back the left references. Even worse if an application submits requests dealing with different files, it will do a put for each new request, so doubling the number of atomics needed. Also, even if not used, it's still takes some cycles in the submission path. If a file used many times, it rather makes sense to pre-register it, if not, we may fall in the described pitfall. So, this optimisation is a matter of use case. Go with the simpliest code-wise way, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	6294f3686b	io_uring: clean up tctx_task_work() After recent fixes, tctx_task_work() always does proper spinlocking before looking into ->task_list, so now we don't need atomics for ->task_state, replace it with non-atomic task_running using the critical section. Tide it up, combine two separate block with spinlocking, and always try to splice in there, so we do less locking when new requests are arriving during the function execution. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix missing ->task_running reset on task_work_add() failure] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d70904367	io_uring: inline io_poll_remove_waitqs Inline io_poll_remove_waitqs() into its only user and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:26 -06:00
Pavel Begunkov	90f67366cb	io_uring: remove extra argument for overflow flush Unlike __io_cqring_overflow_flush(), nobody does forced flushing with io_cqring_overflow_flush(), so removed the argument from it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:19 -06:00
Pavel Begunkov	cd0ca2e048	io_uring: inline struct io_comp_state Inline struct io_comp_state into struct io_submit_state. They are already coupled tightly, together with mixed responsibilities it only brings confusion having them separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:48 -06:00
Pavel Begunkov	bb943b8265	io_uring: use inflight_entry instead of compl.list req->compl.list is used to cache freed requests, and so can't overlap in time with req->inflight_entry. So, use inflight_entry to link requests and remove compl.list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	7255834ed6	io_uring: remove redundant args from cache_free We don't use @tsk argument of io_req_cache_free(), remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	c34b025f2d	io_uring: cache __io_free_req()'d requests Don't kfree requests in __io_free_req() but put them back into the internal request cache. That makes allocations more sustainable and will be used for refcounting optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	f56165e62f	io_uring: move io_fallback_req_func() Move io_fallback_req_func() to kill yet another forward declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:22 -06:00
Pavel Begunkov	e9dbe221f5	io_uring: optimise putting task struct We cache all the reference to task + tctx, so if io_put_task() is called by the corresponding task itself, we can save on atomics and return the refs right back into the cache. It's beneficial for all inline completions, and also iopolling, when polling and submissions are done by the same task, including SQPOLL\|IOPOLL. Note: io_uring_cancel_generic() can return refs to the cache as well, so those should be flushed in the loop for tctx_inflight() to work right. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	af066f31eb	io_uring: drop exec checks from io_req_task_submit In case of on-exec io_uring cancellations, tasks already wait for all submitted requests to get completed/cancelled, so we don't need to check for ->in_execve separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	bbbca09489	io_uring: kill unused IO_IOPOLL_BATCH IO_IOPOLL_BATCH is not used, delete it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	58d3be2c60	io_uring: improve ctx hang handling If io_ring_exit_work() can't get it done in 5 minutes, something is going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help and it may take much of CPU time if there is a lot of workers stuck as such. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	d3fddf6ddd	io_uring: deduplicate open iopoll check Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat and openat2 reuse it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	543af3a13d	io_uring: inline io_free_req_deferred Inline io_free_req_deferred(), there is no reason to keep it separated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	b9bd2bea0f	io_uring: move io_rsrc_node_alloc() definition Move the function together with io_rsrc_node_ref_zero() in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	6a290a1442	io_uring: move io_put_task() definition Move the function in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	e73c5c7cd3	io_uring: extract a helper for ctx quiesce Refactor __io_uring_register() by extracting a helper responsible for ctx queisce. Looks better and will make it easier to add more optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	90291099f2	io_uring: optimise io_cqring_wait() hot path Turns out we always init struct io_wait_queue in io_cqring_wait(), even if it's not used after, i.e. there are already enough of CQEs. And often it's exactly what happens, for instance, requests may have been completed inline, or in case of io_uring_enter(submit=N, wait=1). It shows up in my profiler, so optimise it by delaying the struct init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com [axboe: fixed up for new cqring wait] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	282cdc8693	io_uring: add more locking annotations for submit Add more annotations for submission path functions holding ->uring_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	a2416e1ec2	io_uring: don't halt iopoll too early IOPOLL users should care more about getting completions for requests they submitted, but not in "device did/completed something". Currently, io_do_iopoll() may return a positive number, which will instruct io_iopoll_check() to break the loop and end the syscall, even if there is not enough CQEs or none at all. Don't return positive numbers, so io_iopoll_check() exits only when it gets an actual error, need reschedule or got enough CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	864ea921b0	io_uring: refactor io_alloc_req Replace the main if of io_flush_cached_reqs() with inverted condition + goto, so all the cases are handled in the same way. And also extract io_preinit_req() to make it cleaner and easier to refer to. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	2215bed924	io_uring: remove unnecessary PF_EXITING check We prefer nornal task_works even if it would fail requests inside. Kill a PF_EXITING check in io_req_task_work_add(), task_work_add() handles well dying tasks, i.e. return error when can't enqueue due to late stages of do_exit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ebc11b6c6b	io_uring: clean io-wq callbacks Move io-wq callbacks closer to each other, so it's easier to work with them, and rename io_free_work() into io_wq_free_work() for consistency. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	c97d8a0f68	io_uring: avoid touching inode in rw prep If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set. However, for non-reg files io_prep_rw() still will look into inode to double check, and that's expensive and can be avoided. The only caveat is that it only currently works with 64+ bit architectures, see FFS_ISREG, so we should consider that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	b191e2dfe5	io_uring: rename io_file_supports_async() io_file_supports_async() checks whether a file supports nowait operations, so "async" in the name is misleading. Rename it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ac177053bb	io_uring: inline fixed part of io_file_get() Optimise io_file_get() with registered files, which is in a hot path, by inlining parts of the function. Saves a function call, and inefficiencies of passing arguments, e.g. evaluating (sqe_flags & IOSQE_FIXED_FILE). It couldn't have been done before as compilers were refusing to inline it because of the function size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	042b0d85ea	io_uring: use kvmalloc for fixed files Instead of hand-coded two-level tables for registered files, allocate them with kvmalloc(). In many cases small enough tables are enough, and so can be kmalloc()'ed removing an extra memory load and a bunch of bit logic instructions from the hot path. If the table is larger, we trade off all the pros with a TLB-assisted memory lookup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	5fd4617840	io_uring: be smarter about waking multiple CQ ring waiters Currently we only wake the first waiter, even if we have enough entries posted to satisfy multiple waiters. Improve that situation so that every waiter knows how much the CQ tail has to advance before they can be safely woken up. With this change, if we have N waiters each asking for 1 event and we get 4 completions, then we wake up 4 waiters. If we have N waiters asking for 2 completions and we get 4 completions, then we wake up the first two. Previously, only the first waiter would've been woken up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	a30f895ad3	io_uring: fix xa_alloc_cycle() error return value check We currently check for ret != 0 to indicate error, but '1' is a valid return and just indicates that the allocation succeeded with a wrap. Correct the check to be for < 0, like it was before the xarray conversion. Cc: stable@vger.kernel.org Fixes: `61cf93700f` ("io_uring: Convert personality_idr to XArray") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-20 14:59:58 -06:00
Pavel Begunkov	9cb0073b30	io_uring: pin ctx on fallback execution Pin ring in io_fallback_req_func() by briefly elevating ctx->refs in case any task_work handler touches ctx after releasing a request. Fixes: `9011bf9a13` ("io_uring: fix stuck fallback reqs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/833a494713d235ec144284a9bbfe418df4f6b61c.1629235576.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-17 16:06:14 -06:00
Jens Axboe	21f965221e	io_uring: only assign io_uring_enter() SQPOLL error in actual error case If an SQPOLL based ring is newly created and an application issues an io_uring_enter(2) system call on it, then we can return a spurious -EOWNERDEAD error. This happens because there's nothing to submit, and if the caller doesn't specify any other action, the initial error assignment of -EOWNERDEAD never gets overwritten. This causes us to return it directly, even if it isn't valid. Move the error assignment into the actual failure case instead. Cc: stable@vger.kernel.org Fixes: `d9d05217cb` ("io_uring: stop SQPOLL submit on creator's death") Reported-by: Sherlock Holo sherlockya@gmail.com Link: https://github.com/axboe/liburing/issues/413 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-14 12:38:21 -06:00
Pavel Begunkov	43597aac1f	io_uring: fix ctx-exit io_rsrc_put_work() deadlock __io_rsrc_put_work() might need ->uring_lock, so nobody should wait for rsrc nodes holding the mutex. However, that's exactly what io_ring_ctx_free() does with io_wait_rsrc_data(). Split it into rsrc wait + dealloc, and move the first one out of the lock. Cc: stable@vger.kernel.org Fixes: `b60c8dce33` ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:59:28 -06:00
Jens Axboe	c018db4a57	io_uring: drop ctx->uring_lock before flushing work item Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags from the regression suite: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE ------------------------------------------------------ kworker/2:4/2684 is trying to acquire lock: ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0 but task is already holding lock: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}: __flush_work+0x31b/0x490 io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0 __do_sys_io_uring_register+0x45b/0x1060 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 __mutex_lock+0x86/0x740 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 kthread+0x135/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); * DEADLOCK * 2 locks held by kworker/2:4/2684: #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 stack backtrace: CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015 Workqueue: events io_rsrc_put_work Call Trace: dump_stack_lvl+0x6a/0x9a check_noncircular+0xfe/0x110 __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 ? io_rsrc_put_work+0x13d/0x1a0 __mutex_lock+0x86/0x740 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? process_one_work+0x1ce/0x530 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 ? process_one_work+0x530/0x530 kthread+0x135/0x160 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 which is due to holding the ctx->uring_lock when flushing existing pending work, while the pending work flushing may need to grab the uring lock if we're using IOPOLL. Fix this by dropping the uring_lock a bit earlier as part of the flush. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/404 Tested-by: Ammar Faizi <ammarfaizi2@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:59:06 -06:00
Jens Axboe	4956b9eaad	io_uring: rsrc ref lock needs to be IRQ safe Nadav reports running into the below splat on re-enabling softirqs: WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0 Modules linked in: CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 RIP: 0010:__local_bh_enable_ip+0xaa/0xe0 Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65 RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9 R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890 FS: 00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0 Call Trace: _raw_spin_unlock_bh+0x31/0x40 io_rsrc_node_ref_zero+0x13e/0x190 io_dismantle_req+0x215/0x220 io_req_complete_post+0x1b8/0x720 __io_complete_rw.isra.0+0x16b/0x1f0 io_complete_rw+0x10/0x20 where it's clear we end up calling the percpu count release directly from the completion path, as it's in atomic mode and we drop the last ref. For file/block IO, this can be from IRQ context already, and the softirq locking for rsrc isn't enough. Just make the lock fully IRQ safe, and ensure we correctly safe state from the release path as we don't know the full context there. Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:58:59 -06:00
Nadav Amit	20c0b380f9	io_uring: Use WRITE_ONCE() when writing to sq_flags The compiler should be forbidden from any strange optimization for async writes to user visible data-structures. Without proper protection, the compiler can cause write-tearing or invent writes that would confuse the userspace. However, there are writes to sq_flags which are not protected by WRITE_ONCE(). Use WRITE_ONCE() for these writes. This is purely a theoretical issue. Presumably, any compiler is very unlikely to do such optimizations. Fixes: `75b28affdd` ("io_uring: allocate the two rings together") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-08 21:21:11 -06:00
Nadav Amit	ef98eb0409	io_uring: clear TIF_NOTIFY_SIGNAL when running task work When using SQPOLL, the submission queue polling thread calls task_work_run() to run queued work. However, when work is added with TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains set afterwards and is never cleared. Consequently, when the submission queue polling thread checks whether signal_pending(), it may always find a pending signal, if task_work_add() was ever called before. The impact of this bug might be different on different kernel versions. It appears that on 5.14 it would only cause unnecessary calculation and prevent the polling thread from sleeping. On 5.13, where the bug was found, it stops the polling thread from finding newly submitted work. Instead of task_work_run(), use tracehook_notify_signal() that clears TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to current->task_works to avoid a race in which task_works is cleared but the TIF_NOTIFY_SIGNAL is set. Fixes: `685fe7feed` ("io-wq: eliminate the need for a manager thread") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-08 21:21:11 -06:00
Hao Xu	a890d01e4e	io_uring: fix poll requests leaking second poll entries For pure poll requests, it doesn't remove the second poll wait entry when it's done, neither after vfs_poll() or in the poll completion handler. We should remove the second poll wait entry. And we use io_poll_remove_double() rather than io_poll_remove_waitqs() since the latter has some redundant logic. Fixes: `88e41cf928` ("io_uring: add multishot mode for IORING_OP_POLL_ADD") Cc: stable@vger.kernel.org # 5.13+ Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210728030322.12307-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-28 07:24:57 -06:00
Jens Axboe	ef04688871	io_uring: don't block level reissue off completion path Some setups, like SCSI, can throw spurious -EAGAIN off the softirq completion path. Normally we expect this to happen inline as part of submission, but apparently SCSI has a weird corner case where it can happen as part of normal completions. This should be solved by having the -EAGAIN bubble back up the stack as part of submission, but previous attempts at this failed and we're not just quite there yet. Instead we currently use REQ_F_REISSUE to handle this case. For now, catch it in io_rw_should_reissue() and prevent a reissue from a bogus path. Cc: stable@vger.kernel.org Reported-by: Fabian Ebner <f.ebner@proxmox.com> Tested-by: Fabian Ebner <f.ebner@proxmox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-28 07:24:38 -06:00
Jens Axboe	773af69121	io_uring: always reissue from task_work context As a safeguard, if we're going to queue async work, do it from task_work from the original task. This ensures that we can always sanely create threads, regards of what the reissue context may be. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-27 10:49:48 -06:00
Jens Axboe	110aa25c3c	io_uring: fix race in unified task_work running We use a bit to manage if we need to add the shared task_work, but a list + lock for the pending work. Before aborting a current run of the task_work we check if the list is empty, but we do so without grabbing the lock that protects it. This can lead to races where we think we have nothing left to run, where in practice we could be racing with a task adding new work to the list. If we do hit that race condition, we could be left with work items that need processing, but the shared task_work is not active. Ensure that we grab the lock before checking if the list is empty, so we know if it's safe to exit the run or not. Link: https://lore.kernel.org/io-uring/c6bd5987-e9ae-cd02-49d0-1b3ac1ef65b1@tnonline.net/ Cc: stable@vger.kernel.org # 5.11+ Reported-by: Forza <forza@tnonline.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-26 10:42:56 -06:00
Pavel Begunkov	44eff40a32	io_uring: fix io_prep_async_link locking io_prep_async_link() may be called after arming a linked timeout, automatically making it unsafe to traverse the linked list. Guard with completion_lock if there was a linked timeout. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93f7c617e2b4f012a2a175b3dab6bc2f27cebc48.1627304436.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-26 08:58:04 -06:00
Jens Axboe	991468dcf1	io_uring: explicitly catch any illegal async queue attempt Catch an illegal case to queue async from an unrelated task that got the ring fd passed to it. This should not be possible to hit, but better be proactive and catch it explicitly. io-wq is extended to check for early IO_WQ_WORK_CANCEL being set on a work item as well, so it can run the request through the normal cancelation path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-23 16:44:51 -06:00
Jens Axboe	3c30ef0f78	io_uring: never attempt iopoll reissue from release path There are two reasons why this shouldn't be done: 1) Ring is exiting, and we're canceling requests anyway. Any request should be canceled anyway. In theory, this could iterate for a number of times if someone else is also driving the target block queue into request starvation, however the likelihood of this happening is miniscule. 2) If the original task decided to pass the ring to another task, then we don't want to be reissuing from this context as it may be an unrelated task or context. No assumptions should be made about the context in which ->release() is run. This can only happen for pure read/write, and we'll get -EFAULT on them anyway. Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-23 16:32:48 -06:00
Jens Axboe	0cc936f74b	io_uring: fix early fdput() of file A previous commit shuffled some code around, and inadvertently used struct file after fdput() had been called on it. As we can't touch the file post fdput() dropping our reference, move the fdput() to after that has been done. Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/YPnqM0fY3nM5RdRI@zeniv-ca.linux.org.uk/ Fixes: `f2a48dd09b` ("io_uring: refactor io_sq_offload_create()") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-22 17:11:46 -06:00
Yang Yingliang	362a9e6528	io_uring: fix memleak in io_init_wq_offload() I got memory leak report when doing fuzz test: BUG: memory leak unreferenced object 0xffff888107310a80 (size 96): comm "syz-executor.6", pid 4610, jiffies 4295140240 (age 20.135s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<000000001974933b>] kmalloc include/linux/slab.h:591 [inline] [<000000001974933b>] kzalloc include/linux/slab.h:721 [inline] [<000000001974933b>] io_init_wq_offload fs/io_uring.c:7920 [inline] [<000000001974933b>] io_uring_alloc_task_context+0x466/0x640 fs/io_uring.c:7955 [<0000000039d0800d>] __io_uring_add_tctx_node+0x256/0x360 fs/io_uring.c:9016 [<000000008482e78c>] io_uring_add_tctx_node fs/io_uring.c:9052 [inline] [<000000008482e78c>] __do_sys_io_uring_enter fs/io_uring.c:9354 [inline] [<000000008482e78c>] __se_sys_io_uring_enter fs/io_uring.c:9301 [inline] [<000000008482e78c>] __x64_sys_io_uring_enter+0xabc/0xc20 fs/io_uring.c:9301 [<00000000b875f18f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<00000000b875f18f>] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 [<000000006b0a8484>] entry_SYSCALL_64_after_hwframe+0x44/0xae CPU0 CPU1 io_uring_enter io_uring_enter io_uring_add_tctx_node io_uring_add_tctx_node __io_uring_add_tctx_node __io_uring_add_tctx_node io_uring_alloc_task_context io_uring_alloc_task_context io_init_wq_offload io_init_wq_offload hash = kzalloc hash = kzalloc ctx->hash_map = hash ctx->hash_map = hash <- one of the hash is leaked When calling io_uring_enter() in parallel, the 'hash_map' will be leaked, add uring_lock to protect 'hash_map'. Fixes: `e941894eae` ("io-wq: make buffered file write hashed work map per-ctx") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210720083805.3030730-1-yangyingliang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:51:47 -06:00
Pavel Begunkov	46fee9ab02	io_uring: remove double poll entry on arm failure __io_queue_proc() can enqueue both poll entries and still fail afterwards, so the callers trying to cancel it should also try to remove the second poll entry (if any). For example, it may leave the request alive referencing a io_uring context but not accessible for cancellation: [ 282.599913][ T1620] task:iou-sqp-23145 state:D stack:28720 pid:23155 ppid: 8844 flags:0x00004004 [ 282.609927][ T1620] Call Trace: [ 282.613711][ T1620] __schedule+0x93a/0x26f0 [ 282.634647][ T1620] schedule+0xd3/0x270 [ 282.638874][ T1620] io_uring_cancel_generic+0x54d/0x890 [ 282.660346][ T1620] io_sq_thread+0xaac/0x1250 [ 282.696394][ T1620] ret_from_fork+0x1f/0x30 Cc: stable@vger.kernel.org Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Reported-and-tested-by: syzbot+ac957324022b7132accf@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:50:42 -06:00
Pavel Begunkov	68b11e8b15	io_uring: explicitly count entries for poll reqs If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc() failed, but it goes on with a third waitqueue, it may succeed and overwrite the error status. Count the number of poll entries we added, so we can set pt->error to zero at the beginning and find out when the mentioned scenario happens. Cc: stable@vger.kernel.org Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:50:42 -06:00
Pavel Begunkov	1b48773f9f	io_uring: fix io_drain_req() io_drain_req() return whether the request has been consumed or not, not an error code. Fix a stupid mistake slipped from optimisation patches. Reported-by: syzbot+ba6fcd859210f4e9e109@syzkaller.appspotmail.com Fixes: `76cc33d791` ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d3c53c4274ffff307c8ae062fc7fda63b978df2.1626039606.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-11 16:39:06 -06:00
Pavel Begunkov	9c6882608b	io_uring: use right task for exiting checks When we use delayed_work for fallback execution of requests, current will be not of the submitter task, and so checks in io_req_task_submit() may not behave as expected. Currently, it leaves inline completions not flushed, so making io_ring_exit_work() to hang. Use the submitter task for all those checks. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-11 16:39:06 -06:00
Jens Axboe	9ce85ef2cb	io_uring: remove dead non-zero 'poll' check Colin reports that Coverity complains about checking for poll being non-zero after having dereferenced it multiple times. This is a valid complaint, and actually a leftover from back when this code was based on the aio poll code. Kill the redundant check. Link: https://lore.kernel.org/io-uring/fe70c532-e2a7-3722-58a1-0fa4e5c5ff2c@canonical.com/ Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-09 08:20:28 -06:00
Pavel Begunkov	8f487ef2cb	io_uring: mitigate unlikely iopoll lag We have requests like IORING_OP_FILES_UPDATE that don't go through ->iopoll_list but get completed in place under ->uring_lock, and so after dropping the lock io_iopoll_check() should expect that some CQEs might have get completed in a meanwhile. Currently such events won't be accounted in @nr_events, and the loop will continue to poll even if there is enough of CQEs. It shouldn't be a problem as it's not likely to happen and so, but not nice either. Just return earlier in this case, it should be enough. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ef932cc66a34e3771bbae04b2953a8058e9d05.1625747741.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-08 14:07:43 -06:00
Pavel Begunkov	c32aace0cf	io_uring: fix drain alloc fail return code After a recent change io_drain_req() started to fail requests with result=0 in case of allocation failure, where it should be and have been -ENOMEM. Fixes: `76cc33d791` ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e068110ac4293e0c56cfc4d280d0f22b9303ec08.1625682153.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-07 12:49:32 -06:00
Pavel Begunkov	e09ee51060	io_uring: fix exiting io_req_task_work_add leaks If one entered io_req_task_work_add() not seeing PF_EXITING, it will set a ->task_state bit and try task_work_add(), which may fail by that moment. If that happens the function would try to cancel the request. However, in a meanwhile there might come other io_req_task_work_add() callers, which will see the bit set and leave their requests in the list, which will never be executed. Don't propagate an error, but clear the bit first and then fallback all requests that we can splice from the list. The callback functions have to be able to deal with PF_EXITING, so poll and apoll was modified via changing io_poll_rewait(). Fixes: `7cbf1722d5` ("io_uring: provide FIFO ordering for task_work") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/060002f19f1fdbd130ba24aef818ea4d3080819b.1625142209.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:32 -06:00
Pavel Begunkov	5b0a6acc73	io_uring: simplify task_work func Since we don't really use req->task_work anymore, get rid of it together with the nasty ->func aliasing between ->io_task_work and ->task_work, and hide ->fallback_node inside of io_task_work. Also, as task_work is gone now, replace the callback type from task_work_func_t to a function taking io_kiocb to avoid casting and simplify code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:23 -06:00

1 2 3 4 5 ...

1607 Commits