OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jens Axboe	46568e9be7	io_uring: fix error clear of ->file_table in io_sqe_files_register() syzbot reports that when using failslab and friends, we can get a double free in io_sqe_files_unregister(): BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468 __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450 kasan_slab_free+0xe/0x10 mm/kasan/common.c:480 __cache_free mm/slab.c:3426 [inline] kfree+0x10a/0x2c0 mm/slab.c:3757 io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 io_ring_ctx_free fs/io_uring.c:3998 [inline] io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060 io_uring_release+0x42/0x50 fs/io_uring.c:4068 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x904/0x2e60 kernel/exit.c:817 do_group_exit+0x135/0x360 kernel/exit.c:921 __do_sys_exit_group kernel/exit.c:932 [inline] __se_sys_exit_group kernel/exit.c:930 [inline] __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x43f2c8 Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b 6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00 00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48 RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8 RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000 This happens if we fail allocating the file tables. For that case we do free the file table correctly, but we forget to set it to NULL. This means that ring teardown will see it as being non-NULL, and attempt to free it again. Fix this by clearing the file_table pointer if we free the table. Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com Fixes: `65e19f54d2` ("io_uring: support for larger fixed file sets") Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jackie Liu	c69f8dbe24	io_uring: separate the io_free_req and io_free_req_find_next interface Similar to the distinction between io_put_req and io_put_req_find_next, io_free_req has been modified similarly, with no functional changes. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jackie Liu	ec9c02ad4c	io_uring: keep io_put_req only responsible for release and put req We already have io_put_req_find_next to find the next req of the link. we should not use the io_put_req function to find them. They should be functions of the same level. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jackie Liu	a197f664a0	io_uring: remove passed in 'ctx' function parameter ctx if possible Many times, the core of the function is req, and req has already set req->ctx at initialization time, so there is no need to pass in the ctx from the caller. Cleanup, no functional change. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jens Axboe	206aefde4f	io_uring: reduce/pack size of io_ring_ctx With the recent flurry of additions and changes to io_uring, the layout of io_ring_ctx has become a bit stale. We're right now at 704 bytes in size on my x86-64 build, or 11 cachelines. This patch does two things: - We have to completion structs embedded, that we only use for quiesce of the ctx (or shutdown) and for sqthread init cases. That 2x32 bytes right there, let's dynamically allocate them. - Reorder the struct a bit with an eye on cachelines, use cases, and holes. With this patch, we're down to 512 bytes, or 8 cachelines. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-10 20:29:49 -07:00
Jens Axboe	5f8fd2d3e0	io_uring: properly mark async work as bounded vs unbounded Now that io-wq supports separating the two request lifetime types, mark the following IO as having unbounded runtimes: - Any read/write to a non-regular file - Any specific networked IO - Any poll command Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 11:57:17 -07:00
Jens Axboe	c5def4ab84	io-wq: add support for bounded vs unbunded work io_uring supports request types that basically have two different lifetimes: 1) Bounded completion time. These are requests like disk reads or writes, which we know will finish in a finite amount of time. 2) Unbounded completion time. These are generally networked IO, where we have no idea how long they will take to complete. Another example is POLL commands. This patch provides support for io-wq to handle these differently, so we don't starve bounded requests by tying up workers for too long. By default all work is bounded, unless otherwise specified in the work item. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 11:41:35 -07:00
Jens Axboe	91d666ea43	io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful() We hold the wqe lock at this point (which is also annotated), so there's no need to use the careful variant of list_empty(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-09 11:45:32 -07:00
Jens Axboe	1d7bb1d50f	io_uring: add support for backlogged CQ ring Currently we drop completion events, if the CQ ring is full. That's fine for requests with bounded completion times, but it may make it harder or impossible to use io_uring with networked IO where request completion times are generally unbounded. Or with POLL, for example, which is also unbounded. After this patch, we never overflow the ring, we simply store requests in a backlog for later flushing. This flushing is done automatically by the kernel. To prevent the backlog from growing indefinitely, if the backlog is non-empty, we apply back pressure on IO submissions. Any attempt to submit new IO with a non-empty backlog will get an -EBUSY return from the kernel. This is a signal to the application that it has backlogged CQ events, and that it must reap those before being allowed to submit more IO. Note that if we do return -EBUSY, we will have filled whatever backlogged events into the CQ ring first, if there's room. This means the application can safely reap events WITHOUT entering the kernel and waiting for them, they are already available in the CQ ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-09 11:45:29 -07:00
Jens Axboe	78e19bbef3	io_uring: pass in io_kiocb to fill/add CQ handlers This is in preparation for handling CQ ring overflow a bit smarter. We should not have any functional changes in this patch. Most of the changes are fairly straight forward, the only ones that stick out a bit are the ones that change __io_free_req() to take the reference count into account. If the request hasn't been submitted yet, we know it's safe to simply ignore references and free it. But let's clean these up too, as later patches will depend on the caller doing the right thing if the completion logging grabs a reference to the request. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-08 06:57:27 -07:00
Jens Axboe	84f97dc233	io_uring: make io_cqring_events() take 'ctx' as argument The rings can be derived from the ctx, and we need the ctx there for a future change. No functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-08 06:57:21 -07:00
Jens Axboe	2665abfd75	io_uring: add support for linked SQE timeouts While we have support for generic timeouts, we don't have a way to tie a timeout to a specific SQE. The generic timeouts simply trigger wakeups on the CQ ring. This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid as a link to a previous command. The timeout specific can be either relative or absolute, following the same rules as IORING_OP_TIMEOUT. If the timeout triggers before the dependent command completes, it will attempt to cancel that command. Likewise, if the dependent command completes before the timeout triggers, it will cancel the timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 19:12:40 -07:00
Jens Axboe	e977d6d34f	io_uring: abstract out io_async_cancel_one() helper We're going to need this helper in a future patch, so move it out of io_async_cancel() and into its own separate function. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 12:31:31 -07:00
Pavel Begunkov	267bc90442	io_uring: use inlined struct sqe_submit req->submit is always up-to-date, use it directly Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 20:23:02 -07:00
Pavel Begunkov	50585b9a07	io_uring: Use submit info inlined into req Stack allocated struct sqe_submit is passed down to the submission path along with a request (a.k.a. struct io_kiocb), and will be copied into req->submit for async requests. As space for it is already allocated, fill req->submit in the first place instead of using on-stack one. As a result: 1. sqe->submit is the only place for sqe_submit and is always valid, so we don't need to track which one to use. 2. don't need to copy in case of async 3. allows to simplify the code by not carrying it as an argument all the way down 4. allows to reduce number of function arguments / potentially improve spilling The downside is that stack is most probably be cached, that's not true for just allocated memory for a request. Another concern is cache pollution. Though, a request would be touched and fetched along with req->submit at some point anyway, so shouldn't be a problem. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 20:23:00 -07:00
Pavel Begunkov	196be95cd5	io_uring: allocate io_kiocb upfront Let io_submit_sqes() to allocate io_kiocb before fetching an sqe. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 20:21:37 -07:00
Pavel Begunkov	e5eb6366ac	io_uring: io_queue_link*() right after submit After a call to io_submit_sqe(), it's already known whether it needs to queue a link or not. Do it there, as it's simplier and doesn't keep an extra variable across the loop. Reviewed-by：Bob Liu <bob.liu@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 11:20:11 -07:00
Pavel Begunkov	ae9428ca61	io_uring: Merge io_submit_sqes and io_ring_submit io_submit_sqes() and io_ring_submit() are doing the same stuff with a little difference. Deduplicate them. Reviewed-by：Bob Liu <bob.liu@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-06 11:20:07 -07:00
Jens Axboe	3aa5fa0305	io_uring: kill dead REQ_F_LINK_DONE flag We had no more use for this flag after the conversion to io-wq, kill it off. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 20:34:32 -07:00
Jens Axboe	f1f40853c0	io_uring: fixup a few spots where link failure isn't flagged If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if REQ_F_LINK is set. Any failure in the chain should break the chain. We were missing a few spots where this should be done. It might be nice to generalize this somewhat at some point, as long as we factor in the fact that failure looks different for each request type. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 20:33:16 -07:00
Jens Axboe	89723d0bd6	io_uring: enable optimized link handling for IORING_OP_POLL_ADD As introduced by commit: `ba816ad61f` ("io_uring: run dependent links inline if possible") enable inline dependent link running for poll commands. io_poll_complete_work() is the most important change, as it allows a linked sequence of { POLL, READ } (for example) to proceed inline instead of needing to get punted to another async context. The submission side only potentially matters for sqthread, but may as well include that bit. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 15:32:58 -07:00
Jens Axboe	6f72653e76	io-wq: use proper nesting IRQ disabling spinlocks for cancel We don't know what context we'll be called in for cancel, it could very well be with IRQs disabled already. Use the IRQ saving variants of the locking primitives. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-05 13:53:53 -07:00
Jens Axboe	1056ef9403	MAINTAINERS: update io_uring entry We now have a list that's appropriate for both kernel and userspace discussions on io_uring usage and development, add that to the MAINTAINERS entry. Also add the io-wq files. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 08:50:02 -07:00
Jens Axboe	51c3ff62ca	io_uring: add completion trace event We currently don't have a completion event trace, add one of those. And to better be able to match up submissions and completions, add user_data to the submission trace as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 07:07:52 -07:00
YueHaibing	364b05fd06	io-wq: use kfree_rcu() to simplify the code The callback function of call_rcu() just calls kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-02 07:59:46 -06:00
Jens Axboe	0069fc6b1c	io_uring: remove io_uring_add_to_prev() trace event This internal logic was killed with the conversion to io-wq, so we no longer have a need for this particular trace. Kill it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-01 12:44:40 -06:00
Jackie Liu	e9ffa5c2b7	io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait We didn't use -ERESTARTSYS to tell the application layer to restart the system call, but instead return -EINTR. we can set -EINTR directly when wakeup by the signal, which can help us save an assignment operation and comparison operation. Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-01 08:36:36 -06:00
Jens Axboe	62755e35df	io_uring: support for generic async request cancel This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-01 08:35:31 -06:00
Jens Axboe	975c99a570	io_uring: io_wq_create() returns an error pointer, not NULL syzbot reported an issue where we crash at setup time if failslab is used. The issue is that io_wq_create() returns an error pointer on failure, not NULL. Hence io_uring thought the io-wq was setup just fine, but in reality it's a garbage error pointer. Use IS_ERR() instead of a NULL check, and assign ret appropriately. Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-30 08:42:56 -06:00
Jens Axboe	842f96124c	io_uring: fix race with canceling timeouts If we get -1 from hrtimer_try_to_cancel(), we know that the timer is running. Hence leave all completion to the timeout handler. If we don't, we can corrupt the list and miss a completion. Fixes: `11365043e5` ("io_uring: add support for canceling timeout requests") Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 15:43:30 -06:00
Jens Axboe	65e19f54d2	io_uring: support for larger fixed file sets There's been a few requests for supporting more fixed files than 1024. This isn't really tricky to do, we just need to split up the file table into multiple tables and index appropriately. As we do so, reduce the max single file table to 512. This enables us to do single page allocs always for the tables, which is an improvement over the situation prior. This patch adds support for up to 64K files, which should be enough for everyone. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	b7620121dc	io_uring: protect fixed file indexing with array_index_nospec() We index the file tables with a user given value. After we check it's within our limits, use array_index_nospec() to prevent any spectre attacks here. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	17f2fe35d0	io_uring: add support for IORING_OP_ACCEPT This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	de2ea4b64b	net: add __sys_accept4_file() helper This is identical to __sys_accept4(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags. No functional changes in this patch. Cc: netdev@vger.kernel.org Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	fcb323cc53	io_uring: io_uring: add support for async work inheriting files This is in preparation for adding opcodes that need to add new files in a process file table, system calls like open(2) or accept4(2). If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work item. If work that needs to get punted to async context have this set, the async worker will assume the original task file table before executing the work. Note that opcodes that need access to the current files of an application cannot be done through IORING_SETUP_SQPOLL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	561fb04a6a	io_uring: replace workqueue usage with io-wq Drop various work-arounds we have for workqueues: - We no longer need the async_list for tracking sequential IO. - We don't have to maintain our own mm tracking/setting. - We don't need a separate workqueue for buffered writes. This didn't even work that well to begin with, as it was suboptimal for multiple buffered writers on multiple files. - We can properly cancel pending interruptible work. This fixes deadlocks with particularly socket IO, where we cannot cancel them when the io_uring is closed. Hence the ring will wait forever for these requests to complete, which may never happen. This is different from disk IO where we know requests will complete in a finite amount of time. - Due to being able to cancel work interruptible work that is already running, we can implement file table support for work. We need that for supporting system calls that add to a process file table. - It gets us one step closer to adding async support for any system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	771b53d033	io-wq: small threadpool implementation for io_uring This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:00 -06:00
Pavel Begunkov	95a1b3ff9a	io_uring: Fix mm_fault with READ/WRITE_FIXED Commit `fb5ccc9878` ("io_uring: Fix broken links with offloading") introduced a potential performance regression with unconditionally taking mm even for READ/WRITE_FIXED operations. Return the logic handling it back. mm-faulted requests will go through the generic submission path, so honoring links and drains, but will fail further on req->has_user check. Fixes: `fb5ccc9878` ("io_uring: Fix broken links with offloading") Cc: stable@vger.kernel.org # v5.4 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:24:23 -06:00
Pavel Begunkov	fa45622808	io_uring: remove index from sqe_submit submit->index is used only for inbound check in submission path (i.e. head < ctx->sq_entries). However, it always will be true, as 1. it's already validated by io_get_sqring() 2. ctx->sq_entries can't be changedd in between, because of held ctx->uring_lock and ctx->refs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:24:21 -06:00
Dmitrii Dolgov	c826bd7a74	io_uring: add set of tracing events To trace io_uring activity one can get an information from workqueue and io trace events, but looks like some parts could be hard to identify via this approach. Making what happens inside io_uring more transparent is important to be able to reason about many aspects of it, hence introduce the set of tracing events. All such events could be roughly divided into two categories: * those, that are helping to understand correctness (from both kernel and an application point of view). E.g. a ring creation, file registration, or waiting for available CQE. Proposed approach is to get a pointer to an original structure of interest (ring context, or request), and then find relevant events. io_uring_queue_async_work also exposes a pointer to work_struct, to be able to track down corresponding workqueue events. * those, that provide performance related information. Mostly it's about events that change the flow of requests, e.g. whether an async work was queued, or delayed due to some dependencies. Another important case is how io_uring optimizations (e.g. registered files) are utilized. Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:24:18 -06:00
Jens Axboe	11365043e5	io_uring: add support for canceling timeout requests We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:50 -06:00
Jens Axboe	a41525ab2e	io_uring: add support for absolute timeouts This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:48 -06:00
Jackie Liu	ba5290ccb6	io_uring: replace s->needs_lock with s->in_async There is no function change, just to clean up the code, use s->in_async to make the code know where it is. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:47 -06:00
Jens Axboe	33a107f0a1	io_uring: allow application controlled CQ ring size We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:46 -06:00
Jens Axboe	c3a31e6056	io_uring: add support for IORING_REGISTER_FILES_UPDATE Allows the application to remove/replace/add files to/from a file set. Passes in a struct: struct io_uring_files_update { __u32 offset; __s32 *fds; }; that holds an array of fds, size of array passed in through the usual nr_args part of the io_uring_register() system call. The logic is as follows: 1) If ->fds[i] is -1, the existing file at i + ->offset is removed from the set. 2) If ->fds[i] is a valid fd, the existing file at i + ->offset is replaced with ->fds[i]. For case #2, is the existing file is currently empty (fd == -1), the new fd is simply added to the array. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:44 -06:00
Jens Axboe	08a451739a	io_uring: allow sparse fixed file sets This is in preparation for allowing updates to fixed file sets without requiring a full unregister+register. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:43 -06:00
Jens Axboe	ba816ad61f	io_uring: run dependent links inline if possible Currently any dependent link is executed from a new workqueue context, which means that we'll be doing a context switch per link in the chain. If we are running the completion of the current request from our async workqueue and find that the next request is a link, then run it directly from the workqueue context instead of forcing another switch. This improves the performance of linked SQEs, and reduces the CPU overhead. Reviewed-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:41 -06:00
Anton Ivanov	d848074b2f	um-ubd: Entrust re-queue to the upper layers Fixes crashes due to ubd requeue logic conflicting with the block-mq logic. Crash is reproducible in 5.0 - 5.3. Fixes: `53766defb8` ("um: Clean-up command processing in UML UBD driver") Cc: stable@vger.kernel.org # v5.0+ Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:07:41 -06:00
Anton Eidelman	86cccfbf77	nvme-multipath: remove unused groups_only mode in ana log groups_only mode in nvme_read_ana_log() is no longer used: remove it. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 08:55:00 -06:00
Anton Eidelman	af8fd04247	nvme-multipath: fix possible io hang after ctrl reconnect The following scenario results in an IO hang: 1) ctrl completes a request with NVME_SC_ANA_TRANSITION. NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered. 2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl. This fails because ctrl disconnects. Therefore nvme_update_ns_ana_state() is not called and NVME_NS_ANA_PENDING bit in ns->flags is not cleared. 3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls nvme_read_ana_log(ctrl, groups_only=true). However, nvme_update_ana_state() does not update namespaces because nr_nsids = 0 (due to groups_only mode). 4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK. Result: The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set. Consequently ctrl will never be considered a viable path by __nvme_find_path(). IO will hang if ctrl is the only or the last path to the namespace. More generally, while ctrl is reconnecting, its ANA state may change. And because nvme_mpath_init() requests ANA log in groups_only mode, these changes are not propagated to the existing ctrl namespaces. This may result in a mal-function or an IO hang. Solution: nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false. This will not harm the new ctrl case (no namespaces present), and will make sure the ANA state of namespaces gets updated after reconnect. Note: Another option would be for nvme_mpath_init() to invoke nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 08:55:00 -06:00

1 2 3 4 5 ...

872390 Commits All Branches Search

872390 Commits

All Branches