linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Pavel Begunkov	f2a8d5c7a2	io_uring: add tee(2) support Add IORING_OP_TEE implementing tee(2) support. Almost identical to splice bits, but without offsets. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-17 14:10:07 -06:00
Stefano Garzarella	7e55a19cf6	io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags This new flag should be set/clear from the application to disable/enable eventfd notifications when a request is completed and queued to the CQ ring. Before this patch, notifications were always sent if an eventfd is registered, so IORING_CQ_EVENTFD_DISABLED is not set during the initialization. It will be up to the application to set the flag after initialization if no notifications are required at the beginning. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-15 12:16:59 -06:00
Stefano Garzarella	0d9b5b3af1	io_uring: add 'cq_flags' field for the CQ ring This patch adds the new 'cq_flags' field that should be written by the application and read by the kernel. This new field is available to the userspace application through 'cq_off.flags'. We are using 4-bytes previously reserved and set to zero. This means that if the application finds this field to zero, then the new functionality is not supported. In the next patch we will introduce the first flag available. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-15 12:16:59 -06:00
Lukas Bulwahn	9f5834c868	io_uring: make spdxcheck.py happy Commit `bbbdeb4720` ("io_uring: dual license io_uring.h uapi header") uses a nested SPDX-License-Identifier to dual license the header. Since then, ./scripts/spdxcheck.py complains: include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR Add parentheses to make spdxcheck.py happy. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-21 14:03:46 -06:00
Jens Axboe	bbbdeb4720	io_uring: dual license io_uring.h uapi header This just syncs the header it with the liburing version, so there's no confusion on the license of the header parts. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-11 07:45:46 -06:00
Jens Axboe	067524e914	io_uring: provide means of removing buffers We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 09:12:56 -06:00
Jens Axboe	bcda7baaa3	io_uring: support buffer selection for OP_READ and OP_RECV If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd. Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible. With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as: (buffer_id << IORING_CQE_BUFFER_SHIFT) \| IORING_CQE_F_BUFFER; Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS. Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 09:12:45 -06:00
Jens Axboe	ddf0322db7	io_uring: add IORING_OP_PROVIDE_BUFFERS IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified. No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 09:12:14 -06:00
Jens Axboe	d7718a9d25	io_uring: use poll driven retry for files that support it Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:06:38 -07:00
Pavel Begunkov	7d67af2c01	io_uring: add splice(2) support Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 14:04:37 -07:00
Jens Axboe	3e4827b05d	io_uring: add support for epoll_ctl(2) This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-29 15:46:09 -07:00
Jens Axboe	75c6a03904	io_uring: support using a registered personality for commands For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:45:31 -07:00
Jens Axboe	071698e13a	io_uring: allow registering credentials If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used. An 'id' is returned for the application to use in subsequent personality support. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:44:44 -07:00
Pavel Begunkov	24369c2e3b	io_uring: add io-wq workqueue sharing If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails. This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend. Reported-by: Jens Axboe <axboe@kernel.dk> Reported-by: Daurnimator <quae@daurnimator.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:44:41 -07:00
Jens Axboe	cccf0ee834	io_uring/io-wq: don't use static creds/mm assignments We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-28 17:44:20 -07:00
Pavel Begunkov	6b47ee6eca	io_uring: optimise sqe-to-req flags translation For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG) req->flags \|= REQ_F_FLAG Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:07 -07:00
Jens Axboe	66f4af93da	io_uring: add support for probing opcodes The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:06 -07:00
Jens Axboe	cebdb98617	io_uring: add support for IORING_OP_OPENAT2 Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	f2842ab5b7	io_uring: enable option to only trigger eventfd for async completions If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <markuspapadakis@icloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	fddafacee2	io_uring: add support for send(2) and recv(2) This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:04 -07:00
Jens Axboe	8110c1a621	io_uring: add support for IORING_SETUP_CLAMP Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	c1ca757bd6	io_uring: add IORING_OP_MADVISE This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:02 -07:00
Jens Axboe	4840e418c2	io_uring: add IORING_OP_FADVISE This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:04:01 -07:00
Jens Axboe	ba04291eb6	io_uring: allow use of offset == -1 to mean file position This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	3a6820f2bb	io_uring: add non-vectored read/write commands For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	ce35a47a3a	io_uring: add IOSQE_ASYNC io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:59 -07:00
Jens Axboe	eddc7ef52a	io_uring: add support for IORING_OP_STATX This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:54 -07:00
Jens Axboe	05f3fb3c53	io_uring: avoid ring quiesce for fixed file set unregister and update We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:03:50 -07:00
Jens Axboe	b5dba59e0c	io_uring: add support for IORING_OP_CLOSE This works just like close(2), unsurprisingly. We remove the file descriptor and post the completion inline, then offload the actual (potential) last file put to async context. Mark the async part of this work as uncancellable, as we really must guarantee that the latter part of the close is run. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	15b71abe7b	io_uring: add support for IORING_OP_OPENAT This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Jens Axboe	d63d1b5edb	io_uring: add support for fallocate() This exposes fallocate(2) through io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:01:53 -07:00
Eugene Syromiatnikov	1292e972ff	io_uring: fix compat for IORING_REGISTER_FILES_UPDATE fds field of struct io_uring_files_update is problematic with regards to compat user space, as pointer size is different in 32-bit, 32-on-64-bit, and 64-bit user space. In order to avoid custom handling of compat in the syscall implementation, make fds __u64 and use u64_to_user_ptr in order to retrieve it. Also, align the field naturally and check that no garbage is passed there. Fixes: `c3a31e6056` ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-20 17:00:44 -07:00
Jens Axboe	9e3aa61ae3	io_uring: ensure we return -EINVAL on unknown opcode If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-11 16:02:32 -07:00
Jens Axboe	4e88d6e779	io_uring: allow unbreakable links Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-10 16:33:06 -07:00
Jens Axboe	da8c969069	io_uring: mark us with IORING_FEAT_SUBMIT_STABLE If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-03 07:04:32 -07:00
Jens Axboe	f8e85cf255	io_uring: add support for IORING_OP_CONNECT This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to. Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:11 -07:00
Jens Axboe	1d7bb1d50f	io_uring: add support for backlogged CQ ring Currently we drop completion events, if the CQ ring is full. That's fine for requests with bounded completion times, but it may make it harder or impossible to use io_uring with networked IO where request completion times are generally unbounded. Or with POLL, for example, which is also unbounded. After this patch, we never overflow the ring, we simply store requests in a backlog for later flushing. This flushing is done automatically by the kernel. To prevent the backlog from growing indefinitely, if the backlog is non-empty, we apply back pressure on IO submissions. Any attempt to submit new IO with a non-empty backlog will get an -EBUSY return from the kernel. This is a signal to the application that it has backlogged CQ events, and that it must reap those before being allowed to submit more IO. Note that if we do return -EBUSY, we will have filled whatever backlogged events into the CQ ring first, if there's room. This means the application can safely reap events WITHOUT entering the kernel and waiting for them, they are already available in the CQ ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-09 11:45:29 -07:00
Jens Axboe	2665abfd75	io_uring: add support for linked SQE timeouts While we have support for generic timeouts, we don't have a way to tie a timeout to a specific SQE. The generic timeouts simply trigger wakeups on the CQ ring. This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid as a link to a previous command. The timeout specific can be either relative or absolute, following the same rules as IORING_OP_TIMEOUT. If the timeout triggers before the dependent command completes, it will attempt to cancel that command. Likewise, if the dependent command completes before the timeout triggers, it will cancel the timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-07 19:12:40 -07:00
Jens Axboe	62755e35df	io_uring: support for generic async request cancel This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-01 08:35:31 -06:00
Jens Axboe	17f2fe35d0	io_uring: add support for IORING_OP_ACCEPT This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 12:43:06 -06:00
Jens Axboe	11365043e5	io_uring: add support for canceling timeout requests We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:50 -06:00
Jens Axboe	a41525ab2e	io_uring: add support for absolute timeouts This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:48 -06:00
Jens Axboe	33a107f0a1	io_uring: allow application controlled CQ ring size We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:46 -06:00
Jens Axboe	c3a31e6056	io_uring: add support for IORING_REGISTER_FILES_UPDATE Allows the application to remove/replace/add files to/from a file set. Passes in a struct: struct io_uring_files_update { __u32 offset; __s32 *fds; }; that holds an array of fds, size of array passed in through the usual nr_args part of the io_uring_register() system call. The logic is as follows: 1) If ->fds[i] is -1, the existing file at i + ->offset is removed from the set. 2) If ->fds[i] is a valid fd, the existing file at i + ->offset is replaced with ->fds[i]. For case #2, is the existing file is currently empty (fd == -1), the new fd is simply added to the array. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 10:22:44 -06:00
Jens Axboe	5262f56798	io_uring: IORING_OP_TIMEOUT support There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid. This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-18 10:43:22 -06:00
Jens Axboe	ac90f249e1	io_uring: expose single mmap capability After commit `75b28affdd` we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that. Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-06 10:26:21 -06:00
Jens Axboe	aa1fa28fc7	io_uring: add support for recvmsg() This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well. We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-09 14:32:14 -06:00
Jens Axboe	0fa03c624d	io_uring: add support for sendmsg() This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field. We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-09 14:32:05 -06:00
Jens Axboe	9e645e1105	io_uring: add support for sqe links With SQE links, we can create chains of dependent SQEs. One example would be queueing an SQE that's a read from one file descriptor, with the linked SQE being a write to another with the same set of buffers. An SQE link will not stall the pipeline, it'll just ensure that dependent SQEs aren't issued before the previous link has completed. Any error at submission or completion time will break the chain of SQEs. For completions, this also includes short reads or writes, as the next SQE could depend on the previous one being fully completed. Any SQE in a chain that gets canceled due to any of the above errors, will get an CQE fill with -ECANCELED as the error value. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-06-24 08:00:18 -06:00
Jens Axboe	9b402849e8	io_uring: add support for eventfd notifications Allow registration of an eventfd, which will trigger an event every time a completion event happens for this io_uring instance. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-02 14:08:54 -06:00

1 2

59 Commits