2020-03-21 19:19:07 +08:00
|
|
|
/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* Header file for the io_uring interface.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2019 Jens Axboe
|
|
|
|
* Copyright (C) 2019 Christoph Hellwig
|
|
|
|
*/
|
|
|
|
#ifndef LINUX_IO_URING_H
|
|
|
|
#define LINUX_IO_URING_H
|
|
|
|
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/types.h>
|
io_uring: add sync cancelation API through io_uring_register()
The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.
However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.
Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:
0 Request found and canceled fine.
> 0 Requests found and canceled. Only happens if asked to
cancel multiple requests, and if the work wasn't in
progress.
-ENOENT Request not found.
-ETIME A timeout on the operation was requested, but the timeout
expired before we could cancel.
and we won't get -EALREADY via this API.
If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.
Link: https://github.com/axboe/liburing/discussions/608
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-19 00:00:50 +08:00
|
|
|
#include <linux/time_types.h>
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
|
2022-08-23 19:45:49 +08:00
|
|
|
#ifdef __cplusplus
|
|
|
|
extern "C" {
|
|
|
|
#endif
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* IO submission data structure (Submission Queue Entry)
|
|
|
|
*/
|
|
|
|
struct io_uring_sqe {
|
|
|
|
__u8 opcode; /* type of operation for this sqe */
|
2019-01-11 13:13:58 +08:00
|
|
|
__u8 flags; /* IOSQE_ flags */
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
__u16 ioprio; /* ioprio for the request */
|
|
|
|
__s32 fd; /* file descriptor to do IO on */
|
2019-10-18 04:42:58 +08:00
|
|
|
union {
|
|
|
|
__u64 off; /* offset into file */
|
|
|
|
__u64 addr2;
|
2022-07-07 22:00:38 +08:00
|
|
|
struct {
|
|
|
|
__u32 cmd_op;
|
|
|
|
__u32 __pad1;
|
|
|
|
};
|
2019-10-18 04:42:58 +08:00
|
|
|
};
|
2020-02-24 16:32:45 +08:00
|
|
|
union {
|
|
|
|
__u64 addr; /* pointer to buffer or iovecs */
|
|
|
|
__u64 splice_off_in;
|
|
|
|
};
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
__u32 len; /* buffer size or number of iovecs */
|
|
|
|
union {
|
|
|
|
__kernel_rwf_t rw_flags;
|
2019-01-12 00:43:02 +08:00
|
|
|
__u32 fsync_flags;
|
2020-06-17 17:53:55 +08:00
|
|
|
__u16 poll_events; /* compatibility */
|
|
|
|
__u32 poll32_events; /* word-reversed for BE */
|
2019-04-10 04:56:44 +08:00
|
|
|
__u32 sync_range_flags;
|
2019-04-20 03:34:07 +08:00
|
|
|
__u32 msg_flags;
|
2019-09-18 02:26:57 +08:00
|
|
|
__u32 timeout_flags;
|
2019-10-18 04:42:58 +08:00
|
|
|
__u32 accept_flags;
|
2019-10-29 11:49:21 +08:00
|
|
|
__u32 cancel_flags;
|
2019-12-12 02:20:36 +08:00
|
|
|
__u32 open_flags;
|
2019-12-14 12:18:10 +08:00
|
|
|
__u32 statx_flags;
|
2019-12-26 13:03:45 +08:00
|
|
|
__u32 fadvise_advice;
|
2020-02-24 16:32:45 +08:00
|
|
|
__u32 splice_flags;
|
2020-09-29 04:23:58 +08:00
|
|
|
__u32 rename_flags;
|
2020-09-29 04:27:37 +08:00
|
|
|
__u32 unlink_flags;
|
2021-07-08 14:34:47 +08:00
|
|
|
__u32 hardlink_flags;
|
2022-03-23 23:44:19 +08:00
|
|
|
__u32 xattr_flags;
|
io_uring: add support for passing fixed file descriptors
With IORING_OP_MSG_RING, one ring can send a message to another ring.
Extend that support to also allow sending a fixed file descriptor to
that ring, enabling one ring to pass a registered descriptor to another
one.
Arguments are extended to pass in:
sqe->addr3 fixed file slot in source ring
sqe->file_index fixed file slot in destination ring
IORING_OP_MSG_RING is extended to take a command argument in sqe->addr.
If set to zero (or IORING_MSG_DATA), it sends just a message like before.
If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according
to the above arguments.
Two common use cases for this are:
1) Server needs to be shutdown or restarted, pass file descriptors to
another onei
2) Backend is split, and one accepts connections, while others then get
the fd passed and handle the actual connection.
Both of those are classic SCM_RIGHTS use cases, and it's not possible to
support them with direct descriptors today.
By default, this will post a CQE to the target ring, similarly to how
IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message
is posted to the target ring. The issuer is expected to notify the
receiver side separately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 18:47:02 +08:00
|
|
|
__u32 msg_ring_flags;
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
__u64 user_data; /* data to be passed back at completion time */
|
2021-06-24 22:09:59 +08:00
|
|
|
/* pack this to avoid bogus arm OABI complaints */
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
union {
|
2021-06-24 22:09:59 +08:00
|
|
|
/* index into fixed buffers, if used */
|
|
|
|
__u16 buf_index;
|
|
|
|
/* for grouped buffer selection */
|
|
|
|
__u16 buf_group;
|
|
|
|
} __attribute__((packed));
|
|
|
|
/* personality to use, if used */
|
|
|
|
__u16 personality;
|
2021-08-25 19:25:45 +08:00
|
|
|
union {
|
|
|
|
__s32 splice_fd_in;
|
|
|
|
__u32 file_index;
|
2022-07-13 04:52:43 +08:00
|
|
|
struct {
|
2022-07-13 04:52:45 +08:00
|
|
|
__u16 addr_len;
|
2022-09-01 18:54:04 +08:00
|
|
|
__u16 __pad3[1];
|
2022-07-13 04:52:43 +08:00
|
|
|
};
|
2021-08-25 19:25:45 +08:00
|
|
|
};
|
2022-05-11 13:47:45 +08:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
__u64 addr3;
|
|
|
|
__u64 __pad2[1];
|
|
|
|
};
|
|
|
|
/*
|
|
|
|
* If the ring is initialized with IORING_SETUP_SQE128, then
|
|
|
|
* this field is used for 80 bytes of arbitrary command data
|
|
|
|
*/
|
|
|
|
__u8 cmd[0];
|
|
|
|
};
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
|
2022-05-08 04:18:44 +08:00
|
|
|
/*
|
|
|
|
* If sqe->file_index is set to this for opcodes that instantiate a new
|
|
|
|
* direct descriptor (like openat/openat2/accept), then io_uring will allocate
|
|
|
|
* an available direct descriptor instead of having the application pass one
|
|
|
|
* in. The picked direct descriptor will be returned in cqe->res, or -ENFILE
|
|
|
|
* if the space is full.
|
|
|
|
*/
|
|
|
|
#define IORING_FILE_INDEX_ALLOC (~0U)
|
|
|
|
|
2020-01-19 01:22:41 +08:00
|
|
|
enum {
|
|
|
|
IOSQE_FIXED_FILE_BIT,
|
|
|
|
IOSQE_IO_DRAIN_BIT,
|
|
|
|
IOSQE_IO_LINK_BIT,
|
|
|
|
IOSQE_IO_HARDLINK_BIT,
|
|
|
|
IOSQE_ASYNC_BIT,
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
IOSQE_BUFFER_SELECT_BIT,
|
io_uring: add option to skip CQE posting
Emitting a CQE is expensive from the kernel perspective. Often, it's
also not convenient for the userspace, spends some cycles on processing
and just complicates the logic. A similar problems goes for linked
requests, where we post an CQE for each request in the link.
Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it.
When set and a request completed successfully, it won't generate a CQE.
When fails, it produces an CQE, but all following linked requests will
be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not.
The notion of "fail" is the same as for link failing-cancellation, where
it's opcode dependent, and _usually_ result >= 0 is a success, but not
always.
Linked timeouts are a bit special. When the requests it's linked to was
not attempted to be executed, e.g. failing linked requests, it follows
the description above. Otherwise, whether a linked timeout will post a
completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that
linked timeout request. Linked timeout never "fail" during execution, so
for them it's unconditional. It's expected for users to not really care
about the result of it but rely solely on the result of the master
request. Another reason for such a treatment is that it's racy, and the
timeout callback may be running awhile the master request posts its
completion.
use case 1:
If one doesn't care about results of some requests, e.g. normal
timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be
posted and need to be handled.
use case 2:
Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last,
and it'll post a completion only for the last one if everything goes
right, otherwise there will be one only one CQE for the first failed
request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-10 23:49:32 +08:00
|
|
|
IOSQE_CQE_SKIP_SUCCESS_BIT,
|
2020-01-19 01:22:41 +08:00
|
|
|
};
|
|
|
|
|
2019-01-11 13:13:58 +08:00
|
|
|
/*
|
|
|
|
* sqe->flags
|
|
|
|
*/
|
2020-01-19 01:22:41 +08:00
|
|
|
/* use fixed fileset */
|
|
|
|
#define IOSQE_FIXED_FILE (1U << IOSQE_FIXED_FILE_BIT)
|
|
|
|
/* issue after inflight IO */
|
|
|
|
#define IOSQE_IO_DRAIN (1U << IOSQE_IO_DRAIN_BIT)
|
|
|
|
/* links next sqe */
|
|
|
|
#define IOSQE_IO_LINK (1U << IOSQE_IO_LINK_BIT)
|
|
|
|
/* like LINK, but stronger */
|
|
|
|
#define IOSQE_IO_HARDLINK (1U << IOSQE_IO_HARDLINK_BIT)
|
|
|
|
/* always go async */
|
|
|
|
#define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
/* select buffer from sqe->buf_group */
|
|
|
|
#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
|
io_uring: add option to skip CQE posting
Emitting a CQE is expensive from the kernel perspective. Often, it's
also not convenient for the userspace, spends some cycles on processing
and just complicates the logic. A similar problems goes for linked
requests, where we post an CQE for each request in the link.
Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it.
When set and a request completed successfully, it won't generate a CQE.
When fails, it produces an CQE, but all following linked requests will
be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not.
The notion of "fail" is the same as for link failing-cancellation, where
it's opcode dependent, and _usually_ result >= 0 is a success, but not
always.
Linked timeouts are a bit special. When the requests it's linked to was
not attempted to be executed, e.g. failing linked requests, it follows
the description above. Otherwise, whether a linked timeout will post a
completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that
linked timeout request. Linked timeout never "fail" during execution, so
for them it's unconditional. It's expected for users to not really care
about the result of it but rely solely on the result of the master
request. Another reason for such a treatment is that it's racy, and the
timeout callback may be running awhile the master request posts its
completion.
use case 1:
If one doesn't care about results of some requests, e.g. normal
timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be
posted and need to be handled.
use case 2:
Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last,
and it'll post a completion only for the last one if everything goes
right, otherwise there will be one only one CQE for the first failed
request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-10 23:49:32 +08:00
|
|
|
/* don't post CQE if request succeeded */
|
|
|
|
#define IOSQE_CQE_SKIP_SUCCESS (1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
|
2019-01-11 13:13:58 +08:00
|
|
|
|
2019-01-09 23:59:42 +08:00
|
|
|
/*
|
|
|
|
* io_uring_setup() flags
|
|
|
|
*/
|
|
|
|
#define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
#define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */
|
|
|
|
#define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */
|
2019-10-05 02:10:03 +08:00
|
|
|
#define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */
|
2019-12-29 06:39:54 +08:00
|
|
|
#define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */
|
2020-01-28 08:15:48 +08:00
|
|
|
#define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */
|
2020-08-27 22:58:31 +08:00
|
|
|
#define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */
|
2022-03-11 03:59:35 +08:00
|
|
|
#define IORING_SETUP_SUBMIT_ALL (1U << 7) /* continue submit on error */
|
2022-04-26 09:49:03 +08:00
|
|
|
/*
|
|
|
|
* Cooperative task running. When requests complete, they often require
|
|
|
|
* forcing the submitter to transition to the kernel to complete. If this
|
|
|
|
* flag is set, work will be done when the task transitions anyway, rather
|
|
|
|
* than force an inter-processor interrupt reschedule. This avoids interrupting
|
|
|
|
* a task running in userspace, and saves an IPI.
|
|
|
|
*/
|
|
|
|
#define IORING_SETUP_COOP_TASKRUN (1U << 8)
|
2022-04-26 09:49:04 +08:00
|
|
|
/*
|
|
|
|
* If COOP_TASKRUN is set, get notified if task work is available for
|
|
|
|
* running and a kernel transition would be needed to run it. This sets
|
|
|
|
* IORING_SQ_TASKRUN in the sq ring flags. Not valid with COOP_TASKRUN.
|
|
|
|
*/
|
|
|
|
#define IORING_SETUP_TASKRUN_FLAG (1U << 9)
|
2022-04-01 09:27:52 +08:00
|
|
|
#define IORING_SETUP_SQE128 (1U << 10) /* SQEs are 128 byte */
|
2022-04-27 02:21:23 +08:00
|
|
|
#define IORING_SETUP_CQE32 (1U << 11) /* CQEs are 32 byte */
|
2022-06-16 17:22:08 +08:00
|
|
|
/*
|
|
|
|
* Only one task is allowed to submit requests
|
|
|
|
*/
|
|
|
|
#define IORING_SETUP_SINGLE_ISSUER (1U << 12)
|
2022-04-01 09:27:52 +08:00
|
|
|
|
2022-04-26 16:29:04 +08:00
|
|
|
enum io_uring_op {
|
2019-12-12 06:55:43 +08:00
|
|
|
IORING_OP_NOP,
|
|
|
|
IORING_OP_READV,
|
|
|
|
IORING_OP_WRITEV,
|
|
|
|
IORING_OP_FSYNC,
|
|
|
|
IORING_OP_READ_FIXED,
|
|
|
|
IORING_OP_WRITE_FIXED,
|
|
|
|
IORING_OP_POLL_ADD,
|
|
|
|
IORING_OP_POLL_REMOVE,
|
|
|
|
IORING_OP_SYNC_FILE_RANGE,
|
|
|
|
IORING_OP_SENDMSG,
|
|
|
|
IORING_OP_RECVMSG,
|
|
|
|
IORING_OP_TIMEOUT,
|
|
|
|
IORING_OP_TIMEOUT_REMOVE,
|
|
|
|
IORING_OP_ACCEPT,
|
|
|
|
IORING_OP_ASYNC_CANCEL,
|
|
|
|
IORING_OP_LINK_TIMEOUT,
|
|
|
|
IORING_OP_CONNECT,
|
2019-12-11 01:38:56 +08:00
|
|
|
IORING_OP_FALLOCATE,
|
2019-12-12 02:20:36 +08:00
|
|
|
IORING_OP_OPENAT,
|
2019-12-12 05:02:38 +08:00
|
|
|
IORING_OP_CLOSE,
|
2022-09-01 18:54:02 +08:00
|
|
|
IORING_OP_FILES_UPDATE,
|
2019-12-14 12:18:10 +08:00
|
|
|
IORING_OP_STATX,
|
2019-12-23 06:19:35 +08:00
|
|
|
IORING_OP_READ,
|
|
|
|
IORING_OP_WRITE,
|
2019-12-26 13:03:45 +08:00
|
|
|
IORING_OP_FADVISE,
|
2019-12-26 13:18:28 +08:00
|
|
|
IORING_OP_MADVISE,
|
2020-01-05 11:19:44 +08:00
|
|
|
IORING_OP_SEND,
|
|
|
|
IORING_OP_RECV,
|
2020-01-09 08:59:24 +08:00
|
|
|
IORING_OP_OPENAT2,
|
2020-01-09 06:18:09 +08:00
|
|
|
IORING_OP_EPOLL_CTL,
|
2020-02-24 16:32:45 +08:00
|
|
|
IORING_OP_SPLICE,
|
2020-02-24 07:41:33 +08:00
|
|
|
IORING_OP_PROVIDE_BUFFERS,
|
2020-03-03 07:32:28 +08:00
|
|
|
IORING_OP_REMOVE_BUFFERS,
|
2020-05-17 19:18:06 +08:00
|
|
|
IORING_OP_TEE,
|
2020-09-06 01:14:22 +08:00
|
|
|
IORING_OP_SHUTDOWN,
|
2020-09-29 04:23:58 +08:00
|
|
|
IORING_OP_RENAMEAT,
|
2020-09-29 04:27:37 +08:00
|
|
|
IORING_OP_UNLINKAT,
|
2021-07-08 14:34:45 +08:00
|
|
|
IORING_OP_MKDIRAT,
|
2021-07-08 14:34:46 +08:00
|
|
|
IORING_OP_SYMLINKAT,
|
2021-07-08 14:34:47 +08:00
|
|
|
IORING_OP_LINKAT,
|
2022-03-10 21:27:26 +08:00
|
|
|
IORING_OP_MSG_RING,
|
2022-03-23 23:44:19 +08:00
|
|
|
IORING_OP_FSETXATTR,
|
|
|
|
IORING_OP_SETXATTR,
|
2022-03-23 23:44:20 +08:00
|
|
|
IORING_OP_FGETXATTR,
|
|
|
|
IORING_OP_GETXATTR,
|
2022-04-13 04:22:40 +08:00
|
|
|
IORING_OP_SOCKET,
|
2022-05-11 13:47:45 +08:00
|
|
|
IORING_OP_URING_CMD,
|
2022-09-01 18:54:04 +08:00
|
|
|
IORING_OP_SEND_ZC,
|
2019-12-12 06:55:43 +08:00
|
|
|
|
|
|
|
/* this goes last, obviously */
|
|
|
|
IORING_OP_LAST,
|
|
|
|
};
|
2019-01-12 00:43:02 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* sqe->fsync_flags
|
|
|
|
*/
|
|
|
|
#define IORING_FSYNC_DATASYNC (1U << 0)
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
|
2019-10-16 06:48:15 +08:00
|
|
|
/*
|
|
|
|
* sqe->timeout_flags
|
|
|
|
*/
|
2021-08-29 09:54:39 +08:00
|
|
|
#define IORING_TIMEOUT_ABS (1U << 0)
|
|
|
|
#define IORING_TIMEOUT_UPDATE (1U << 1)
|
|
|
|
#define IORING_TIMEOUT_BOOTTIME (1U << 2)
|
|
|
|
#define IORING_TIMEOUT_REALTIME (1U << 3)
|
|
|
|
#define IORING_LINK_TIMEOUT_UPDATE (1U << 4)
|
2021-10-03 02:36:14 +08:00
|
|
|
#define IORING_TIMEOUT_ETIME_SUCCESS (1U << 5)
|
2021-08-28 07:11:06 +08:00
|
|
|
#define IORING_TIMEOUT_CLOCK_MASK (IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME)
|
2021-08-29 09:54:39 +08:00
|
|
|
#define IORING_TIMEOUT_UPDATE_MASK (IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE)
|
2020-02-24 16:32:45 +08:00
|
|
|
/*
|
|
|
|
* sqe->splice_flags
|
|
|
|
* extends splice(2) flags
|
|
|
|
*/
|
|
|
|
#define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */
|
|
|
|
|
2021-02-23 13:08:01 +08:00
|
|
|
/*
|
|
|
|
* POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
|
|
|
|
* command flags for POLL_ADD are stored in sqe->len.
|
|
|
|
*
|
|
|
|
* IORING_POLL_ADD_MULTI Multishot poll. Sets IORING_CQE_F_MORE if
|
|
|
|
* the poll handler will continue to report
|
|
|
|
* CQEs on behalf of the same SQE.
|
io_uring: allow events and user_data update of running poll requests
This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and
IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are
masked into sqe->len. If set, the POLL_ADD will have the following
behavior:
- sqe->addr must contain the the user_data of the poll request that
needs to be modified. This field is otherwise invalid for a POLL_ADD
command.
- If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the
new mask for the existing poll request. There are no checks for whether
these are identical or not, if a matching poll request is found, then it
is re-armed with the new mask.
- If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new
user_data for the existing poll request.
A POLL_ADD with any of these flags set may complete with any of the
following results:
1) 0, which means that we successfully found the existing poll request
specified, and performed the re-arm procedure. Any error from that
re-arm will be exposed as a completion event for that original poll
request, not for the update request.
2) -ENOENT, if no existing poll request was found with the given
user_data.
3) -EALREADY, if the existing poll request was already in the process of
being removed/canceled/completing.
4) -EACCES, if an attempt was made to modify an internal poll request
(eg not one originally issued ass IORING_OP_POLL_ADD).
The usual -EINVAL cases apply as well, if any invalid fields are set
in the sqe for this command type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-17 22:37:41 +08:00
|
|
|
*
|
|
|
|
* IORING_POLL_UPDATE Update existing poll request, matching
|
|
|
|
* sqe->addr as the old user_data field.
|
2022-05-28 00:55:07 +08:00
|
|
|
*
|
|
|
|
* IORING_POLL_LEVEL Level triggered poll.
|
2021-02-23 13:08:01 +08:00
|
|
|
*/
|
|
|
|
#define IORING_POLL_ADD_MULTI (1U << 0)
|
io_uring: allow events and user_data update of running poll requests
This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and
IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are
masked into sqe->len. If set, the POLL_ADD will have the following
behavior:
- sqe->addr must contain the the user_data of the poll request that
needs to be modified. This field is otherwise invalid for a POLL_ADD
command.
- If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the
new mask for the existing poll request. There are no checks for whether
these are identical or not, if a matching poll request is found, then it
is re-armed with the new mask.
- If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new
user_data for the existing poll request.
A POLL_ADD with any of these flags set may complete with any of the
following results:
1) 0, which means that we successfully found the existing poll request
specified, and performed the re-arm procedure. Any error from that
re-arm will be exposed as a completion event for that original poll
request, not for the update request.
2) -ENOENT, if no existing poll request was found with the given
user_data.
3) -EALREADY, if the existing poll request was already in the process of
being removed/canceled/completing.
4) -EACCES, if an attempt was made to modify an internal poll request
(eg not one originally issued ass IORING_OP_POLL_ADD).
The usual -EINVAL cases apply as well, if any invalid fields are set
in the sqe for this command type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-17 22:37:41 +08:00
|
|
|
#define IORING_POLL_UPDATE_EVENTS (1U << 1)
|
|
|
|
#define IORING_POLL_UPDATE_USER_DATA (1U << 2)
|
2022-05-28 00:55:07 +08:00
|
|
|
#define IORING_POLL_ADD_LEVEL (1U << 3)
|
2021-02-23 13:08:01 +08:00
|
|
|
|
2022-04-19 00:44:00 +08:00
|
|
|
/*
|
|
|
|
* ASYNC_CANCEL flags.
|
|
|
|
*
|
|
|
|
* IORING_ASYNC_CANCEL_ALL Cancel all requests that match the given key
|
2022-04-19 00:44:01 +08:00
|
|
|
* IORING_ASYNC_CANCEL_FD Key off 'fd' for cancelation rather than the
|
|
|
|
* request 'user_data'
|
2022-04-19 00:44:02 +08:00
|
|
|
* IORING_ASYNC_CANCEL_ANY Match any request
|
2022-06-18 23:47:04 +08:00
|
|
|
* IORING_ASYNC_CANCEL_FD_FIXED 'fd' passed in is a fixed descriptor
|
2022-04-19 00:44:00 +08:00
|
|
|
*/
|
|
|
|
#define IORING_ASYNC_CANCEL_ALL (1U << 0)
|
2022-04-19 00:44:01 +08:00
|
|
|
#define IORING_ASYNC_CANCEL_FD (1U << 1)
|
2022-04-19 00:44:02 +08:00
|
|
|
#define IORING_ASYNC_CANCEL_ANY (1U << 2)
|
2022-06-18 23:47:04 +08:00
|
|
|
#define IORING_ASYNC_CANCEL_FD_FIXED (1U << 3)
|
2022-04-19 00:44:00 +08:00
|
|
|
|
2022-04-27 02:11:33 +08:00
|
|
|
/*
|
2022-06-30 20:25:57 +08:00
|
|
|
* send/sendmsg and recv/recvmsg flags (sqe->ioprio)
|
2022-04-27 02:11:33 +08:00
|
|
|
*
|
|
|
|
* IORING_RECVSEND_POLL_FIRST If set, instead of first attempting to send
|
|
|
|
* or receive and arm poll if that yields an
|
|
|
|
* -EAGAIN result, arm poll upfront and skip
|
|
|
|
* the initial transfer attempt.
|
2022-06-30 17:12:29 +08:00
|
|
|
*
|
|
|
|
* IORING_RECV_MULTISHOT Multishot recv. Sets IORING_CQE_F_MORE if
|
|
|
|
* the handler will continue to report
|
|
|
|
* CQEs on behalf of the same SQE.
|
2022-07-13 04:52:46 +08:00
|
|
|
*
|
|
|
|
* IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
|
|
|
|
* the buf_index field.
|
2022-04-27 02:11:33 +08:00
|
|
|
*/
|
|
|
|
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
|
2022-07-13 04:52:46 +08:00
|
|
|
#define IORING_RECV_MULTISHOT (1U << 1)
|
|
|
|
#define IORING_RECVSEND_FIXED_BUF (1U << 2)
|
2022-04-27 02:11:33 +08:00
|
|
|
|
2022-05-14 22:20:43 +08:00
|
|
|
/*
|
|
|
|
* accept flags stored in sqe->ioprio
|
|
|
|
*/
|
|
|
|
#define IORING_ACCEPT_MULTISHOT (1U << 0)
|
|
|
|
|
io_uring: add support for passing fixed file descriptors
With IORING_OP_MSG_RING, one ring can send a message to another ring.
Extend that support to also allow sending a fixed file descriptor to
that ring, enabling one ring to pass a registered descriptor to another
one.
Arguments are extended to pass in:
sqe->addr3 fixed file slot in source ring
sqe->file_index fixed file slot in destination ring
IORING_OP_MSG_RING is extended to take a command argument in sqe->addr.
If set to zero (or IORING_MSG_DATA), it sends just a message like before.
If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according
to the above arguments.
Two common use cases for this are:
1) Server needs to be shutdown or restarted, pass file descriptors to
another onei
2) Backend is split, and one accepts connections, while others then get
the fd passed and handle the actual connection.
Both of those are classic SCM_RIGHTS use cases, and it's not possible to
support them with direct descriptors today.
By default, this will post a CQE to the target ring, similarly to how
IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message
is posted to the target ring. The issuer is expected to notify the
receiver side separately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 18:47:02 +08:00
|
|
|
/*
|
|
|
|
* IORING_OP_MSG_RING command types, stored in sqe->addr
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
IORING_MSG_DATA, /* pass sqe->len as 'res' and off as user_data */
|
|
|
|
IORING_MSG_SEND_FD, /* send a registered fd to another ring */
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* IORING_OP_MSG_RING flags (sqe->msg_ring_flags)
|
|
|
|
*
|
|
|
|
* IORING_MSG_RING_CQE_SKIP Don't post a CQE to the target ring. Not
|
|
|
|
* applicable for IORING_MSG_DATA, obviously.
|
|
|
|
*/
|
|
|
|
#define IORING_MSG_RING_CQE_SKIP (1U << 0)
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* IO completion data structure (Completion Queue Entry)
|
|
|
|
*/
|
|
|
|
struct io_uring_cqe {
|
|
|
|
__u64 user_data; /* sqe->data submission passed back */
|
|
|
|
__s32 res; /* result code for this event */
|
|
|
|
__u32 flags;
|
2022-04-27 02:21:23 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ring is initialized with IORING_SETUP_CQE32, then this field
|
|
|
|
* contains 16-bytes of padding, doubling the size of the CQE.
|
|
|
|
*/
|
|
|
|
__u64 big_cqe[];
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
/*
|
|
|
|
* cqe->flags
|
|
|
|
*
|
|
|
|
* IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
|
2021-02-23 13:08:01 +08:00
|
|
|
* IORING_CQE_F_MORE If set, parent SQE will generate more CQE entries
|
2022-04-27 07:39:50 +08:00
|
|
|
* IORING_CQE_F_SOCK_NONEMPTY If set, more data to read after socket recv
|
2022-09-01 18:54:04 +08:00
|
|
|
* IORING_CQE_F_NOTIF Set for notification CQEs. Can be used to distinct
|
|
|
|
* them from sends.
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
*/
|
|
|
|
#define IORING_CQE_F_BUFFER (1U << 0)
|
2021-02-23 13:08:01 +08:00
|
|
|
#define IORING_CQE_F_MORE (1U << 1)
|
2022-04-27 07:39:50 +08:00
|
|
|
#define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
|
2022-09-01 18:54:04 +08:00
|
|
|
#define IORING_CQE_F_NOTIF (1U << 3)
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
|
|
|
|
enum {
|
|
|
|
IORING_CQE_BUFFER_SHIFT = 16,
|
|
|
|
};
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* Magic offsets for the application to mmap the data it needs
|
|
|
|
*/
|
|
|
|
#define IORING_OFF_SQ_RING 0ULL
|
|
|
|
#define IORING_OFF_CQ_RING 0x8000000ULL
|
|
|
|
#define IORING_OFF_SQES 0x10000000ULL
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Filled with the offset for mmap(2)
|
|
|
|
*/
|
|
|
|
struct io_sqring_offsets {
|
|
|
|
__u32 head;
|
|
|
|
__u32 tail;
|
|
|
|
__u32 ring_mask;
|
|
|
|
__u32 ring_entries;
|
|
|
|
__u32 flags;
|
|
|
|
__u32 dropped;
|
|
|
|
__u32 array;
|
|
|
|
__u32 resv1;
|
|
|
|
__u64 resv2;
|
|
|
|
};
|
|
|
|
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
/*
|
|
|
|
* sq_ring->flags
|
|
|
|
*/
|
|
|
|
#define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */
|
io_uring: export cq overflow status to userspace
For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:
static void test_cq_overflow(struct io_uring *ring)
{
struct io_uring_cqe *cqe;
struct io_uring_sqe *sqe;
int issued = 0;
int ret = 0;
do {
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
break;;
}
ret = io_uring_submit(ring);
if (ret <= 0) {
if (ret != -EBUSY)
fprintf(stderr, "sqe submit failed: %d\n", ret);
break;
}
issued++;
} while (ret > 0);
assert(ret == -EBUSY);
printf("issued requests: %d\n", issued);
while (issued) {
ret = io_uring_peek_cqe(ring, &cqe);
if (ret) {
if (ret != -EAGAIN) {
fprintf(stderr, "peek completion failed: %s\n",
strerror(ret));
break;
}
printf("left requets: %d\n", issued);
continue;
}
io_uring_cqe_seen(ring, cqe);
issued--;
printf("left requets: %d\n", issued);
}
}
int main(int argc, char *argv[])
{
int ret;
struct io_uring ring;
ret = io_uring_queue_init(16, &ring, 0);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
test_cq_overflow(&ring);
return 0;
}
To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-09 09:15:29 +08:00
|
|
|
#define IORING_SQ_CQ_OVERFLOW (1U << 1) /* CQ ring is overflown */
|
2022-04-26 09:49:04 +08:00
|
|
|
#define IORING_SQ_TASKRUN (1U << 2) /* task should enter the kernel */
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
struct io_cqring_offsets {
|
|
|
|
__u32 head;
|
|
|
|
__u32 tail;
|
|
|
|
__u32 ring_mask;
|
|
|
|
__u32 ring_entries;
|
|
|
|
__u32 overflow;
|
|
|
|
__u32 cqes;
|
2020-05-16 00:38:04 +08:00
|
|
|
__u32 flags;
|
|
|
|
__u32 resv1;
|
|
|
|
__u64 resv2;
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
|
2020-05-16 00:38:05 +08:00
|
|
|
/*
|
|
|
|
* cq_ring->flags
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* disable eventfd notifications */
|
|
|
|
#define IORING_CQ_EVENTFD_DISABLED (1U << 0)
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* io_uring_enter(2) flags
|
|
|
|
*/
|
io_uring: add support for registering ring file descriptors
Lots of workloads use multiple threads, in which case the file table is
shared between them. This makes getting and putting the ring file
descriptor for each io_uring_enter(2) system call more expensive, as it
involves an atomic get and put for each call.
Similarly to how we allow registering normal file descriptors to avoid
this overhead, add support for an io_uring_register(2) API that allows
to register the ring fds themselves:
1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
structs, and registers them with the task.
2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
structs, and unregisters them.
When a ring fd is registered, it is internally represented by an offset.
This offset is returned to the application, and the application then
uses this offset and sets IORING_ENTER_REGISTERED_RING for the
io_uring_enter(2) system call. This works just like using a registered
file descriptor, rather than a real one, in an SQE, where
IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
offset/descriptor rather than a real file descriptor.
In initial testing, this provides a nice bump in performance for
threaded applications in real world cases where the batch count (eg
number of requests submitted per io_uring_enter(2) invocation) is low.
In a microbenchmark, submitting NOP requests, we see the following
increases in performance:
Requests per syscall Baseline Registered Increase
----------------------------------------------------------------
1 ~7030K ~8080K +15%
2 ~13120K ~14800K +13%
4 ~22740K ~25300K +11%
Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-04 23:22:22 +08:00
|
|
|
#define IORING_ENTER_GETEVENTS (1U << 0)
|
|
|
|
#define IORING_ENTER_SQ_WAKEUP (1U << 1)
|
|
|
|
#define IORING_ENTER_SQ_WAIT (1U << 2)
|
|
|
|
#define IORING_ENTER_EXT_ARG (1U << 3)
|
|
|
|
#define IORING_ENTER_REGISTERED_RING (1U << 4)
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Passed in for io_uring_setup(2). Copied back with updated info on success
|
|
|
|
*/
|
|
|
|
struct io_uring_params {
|
|
|
|
__u32 sq_entries;
|
|
|
|
__u32 cq_entries;
|
|
|
|
__u32 flags;
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
__u32 sq_thread_cpu;
|
|
|
|
__u32 sq_thread_idle;
|
2019-09-07 00:26:21 +08:00
|
|
|
__u32 features;
|
2020-01-28 08:15:48 +08:00
|
|
|
__u32 wq_fd;
|
|
|
|
__u32 resv[3];
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
struct io_sqring_offsets sq_off;
|
|
|
|
struct io_cqring_offsets cq_off;
|
|
|
|
};
|
|
|
|
|
2019-09-07 00:26:21 +08:00
|
|
|
/*
|
|
|
|
* io_uring_params->features flags
|
|
|
|
*/
|
|
|
|
#define IORING_FEAT_SINGLE_MMAP (1U << 0)
|
io_uring: add support for backlogged CQ ring
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.
After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.
Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 02:31:17 +08:00
|
|
|
#define IORING_FEAT_NODROP (1U << 1)
|
2019-12-03 09:51:26 +08:00
|
|
|
#define IORING_FEAT_SUBMIT_STABLE (1U << 2)
|
2019-12-26 07:33:42 +08:00
|
|
|
#define IORING_FEAT_RW_CUR_POS (1U << 3)
|
2020-01-28 07:34:48 +08:00
|
|
|
#define IORING_FEAT_CUR_PERSONALITY (1U << 4)
|
io_uring: use poll driven retry for files that support it
Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.
This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.
The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-15 13:23:12 +08:00
|
|
|
#define IORING_FEAT_FAST_POLL (1U << 5)
|
2020-06-17 17:53:55 +08:00
|
|
|
#define IORING_FEAT_POLL_32BITS (1U << 6)
|
2020-09-15 00:51:17 +08:00
|
|
|
#define IORING_FEAT_SQPOLL_NONFIXED (1U << 7)
|
2020-11-03 10:54:37 +08:00
|
|
|
#define IORING_FEAT_EXT_ARG (1U << 8)
|
2021-02-21 02:55:28 +08:00
|
|
|
#define IORING_FEAT_NATIVE_WORKERS (1U << 9)
|
2021-06-10 23:37:38 +08:00
|
|
|
#define IORING_FEAT_RSRC_TAGS (1U << 10)
|
io_uring: add option to skip CQE posting
Emitting a CQE is expensive from the kernel perspective. Often, it's
also not convenient for the userspace, spends some cycles on processing
and just complicates the logic. A similar problems goes for linked
requests, where we post an CQE for each request in the link.
Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it.
When set and a request completed successfully, it won't generate a CQE.
When fails, it produces an CQE, but all following linked requests will
be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not.
The notion of "fail" is the same as for link failing-cancellation, where
it's opcode dependent, and _usually_ result >= 0 is a success, but not
always.
Linked timeouts are a bit special. When the requests it's linked to was
not attempted to be executed, e.g. failing linked requests, it follows
the description above. Otherwise, whether a linked timeout will post a
completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that
linked timeout request. Linked timeout never "fail" during execution, so
for them it's unconditional. It's expected for users to not really care
about the result of it but rely solely on the result of the master
request. Another reason for such a treatment is that it's racy, and the
timeout callback may be running awhile the master request posts its
completion.
use case 1:
If one doesn't care about results of some requests, e.g. normal
timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be
posted and need to be handled.
use case 2:
Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last,
and it'll post a completion only for the last one if everything goes
right, otherwise there will be one only one CQE for the first failed
request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-10 23:49:32 +08:00
|
|
|
#define IORING_FEAT_CQE_SKIP (1U << 11)
|
2022-04-11 05:13:24 +08:00
|
|
|
#define IORING_FEAT_LINKED_FILE (1U << 12)
|
2019-09-07 00:26:21 +08:00
|
|
|
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
/*
|
|
|
|
* io_uring_register(2) opcodes and arguments
|
|
|
|
*/
|
2020-08-27 22:58:29 +08:00
|
|
|
enum {
|
|
|
|
IORING_REGISTER_BUFFERS = 0,
|
|
|
|
IORING_UNREGISTER_BUFFERS = 1,
|
|
|
|
IORING_REGISTER_FILES = 2,
|
|
|
|
IORING_UNREGISTER_FILES = 3,
|
|
|
|
IORING_REGISTER_EVENTFD = 4,
|
|
|
|
IORING_UNREGISTER_EVENTFD = 5,
|
|
|
|
IORING_REGISTER_FILES_UPDATE = 6,
|
|
|
|
IORING_REGISTER_EVENTFD_ASYNC = 7,
|
|
|
|
IORING_REGISTER_PROBE = 8,
|
|
|
|
IORING_REGISTER_PERSONALITY = 9,
|
|
|
|
IORING_UNREGISTER_PERSONALITY = 10,
|
2020-08-27 22:58:30 +08:00
|
|
|
IORING_REGISTER_RESTRICTIONS = 11,
|
2020-08-27 22:58:31 +08:00
|
|
|
IORING_REGISTER_ENABLE_RINGS = 12,
|
io_uring: change registration/upd/rsrc tagging ABI
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.
It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.
Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 23:37:37 +08:00
|
|
|
|
|
|
|
/* extended with tagging */
|
|
|
|
IORING_REGISTER_FILES2 = 13,
|
|
|
|
IORING_REGISTER_FILES_UPDATE2 = 14,
|
|
|
|
IORING_REGISTER_BUFFERS2 = 15,
|
|
|
|
IORING_REGISTER_BUFFERS_UPDATE = 16,
|
2020-08-27 22:58:29 +08:00
|
|
|
|
2021-06-18 00:19:54 +08:00
|
|
|
/* set/clear io-wq thread affinities */
|
|
|
|
IORING_REGISTER_IOWQ_AFF = 17,
|
|
|
|
IORING_UNREGISTER_IOWQ_AFF = 18,
|
|
|
|
|
2021-09-13 23:44:15 +08:00
|
|
|
/* set/get max number of io-wq workers */
|
2021-08-28 01:33:19 +08:00
|
|
|
IORING_REGISTER_IOWQ_MAX_WORKERS = 19,
|
|
|
|
|
io_uring: add support for registering ring file descriptors
Lots of workloads use multiple threads, in which case the file table is
shared between them. This makes getting and putting the ring file
descriptor for each io_uring_enter(2) system call more expensive, as it
involves an atomic get and put for each call.
Similarly to how we allow registering normal file descriptors to avoid
this overhead, add support for an io_uring_register(2) API that allows
to register the ring fds themselves:
1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
structs, and registers them with the task.
2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
structs, and unregisters them.
When a ring fd is registered, it is internally represented by an offset.
This offset is returned to the application, and the application then
uses this offset and sets IORING_ENTER_REGISTERED_RING for the
io_uring_enter(2) system call. This works just like using a registered
file descriptor, rather than a real one, in an SQE, where
IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
offset/descriptor rather than a real file descriptor.
In initial testing, this provides a nice bump in performance for
threaded applications in real world cases where the batch count (eg
number of requests submitted per io_uring_enter(2) invocation) is low.
In a microbenchmark, submitting NOP requests, we see the following
increases in performance:
Requests per syscall Baseline Registered Increase
----------------------------------------------------------------
1 ~7030K ~8080K +15%
2 ~13120K ~14800K +13%
4 ~22740K ~25300K +11%
Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-04 23:22:22 +08:00
|
|
|
/* register/unregister io_uring fd with the ring */
|
|
|
|
IORING_REGISTER_RING_FDS = 20,
|
|
|
|
IORING_UNREGISTER_RING_FDS = 21,
|
|
|
|
|
io_uring: add support for ring mapped supplied buffers
Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.
Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.
Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.
To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:
Test Replenish NOPs/sec
================================================================
No provided buffers NA ~30M
Provided buffers 32 ~16M
Provided buffers 1 ~10M
Ring buffers 32 ~27M
Ring buffers 1 ~27M
The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.
Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-01 04:38:53 +08:00
|
|
|
/* register ring based provide buffer group */
|
|
|
|
IORING_REGISTER_PBUF_RING = 22,
|
|
|
|
IORING_UNREGISTER_PBUF_RING = 23,
|
|
|
|
|
io_uring: add sync cancelation API through io_uring_register()
The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.
However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.
Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:
0 Request found and canceled fine.
> 0 Requests found and canceled. Only happens if asked to
cancel multiple requests, and if the work wasn't in
progress.
-ENOENT Request not found.
-ETIME A timeout on the operation was requested, but the timeout
expired before we could cancel.
and we won't get -EALREADY via this API.
If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.
Link: https://github.com/axboe/liburing/discussions/608
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-19 00:00:50 +08:00
|
|
|
/* sync cancelation API */
|
|
|
|
IORING_REGISTER_SYNC_CANCEL = 24,
|
|
|
|
|
2022-06-25 18:55:38 +08:00
|
|
|
/* register a range of fixed file slots for automatic slot allocation */
|
|
|
|
IORING_REGISTER_FILE_ALLOC_RANGE = 25,
|
|
|
|
|
2020-08-27 22:58:29 +08:00
|
|
|
/* this goes last */
|
|
|
|
IORING_REGISTER_LAST
|
|
|
|
};
|
2019-10-04 03:59:56 +08:00
|
|
|
|
2021-09-13 23:44:15 +08:00
|
|
|
/* io-wq worker categories */
|
|
|
|
enum {
|
|
|
|
IO_WQ_BOUND,
|
|
|
|
IO_WQ_UNBOUND,
|
|
|
|
};
|
|
|
|
|
2021-01-16 01:37:44 +08:00
|
|
|
/* deprecated, see struct io_uring_rsrc_update */
|
2019-10-04 03:59:56 +08:00
|
|
|
struct io_uring_files_update {
|
|
|
|
__u32 offset;
|
2020-01-16 00:35:38 +08:00
|
|
|
__u32 resv;
|
|
|
|
__aligned_u64 /* __s32 * */ fds;
|
2019-10-04 03:59:56 +08:00
|
|
|
};
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
|
2022-05-09 23:29:14 +08:00
|
|
|
/*
|
|
|
|
* Register a fully sparse file space, rather than pass in an array of all
|
|
|
|
* -1 file descriptors.
|
|
|
|
*/
|
|
|
|
#define IORING_RSRC_REGISTER_SPARSE (1U << 0)
|
|
|
|
|
2021-04-25 21:32:21 +08:00
|
|
|
struct io_uring_rsrc_register {
|
|
|
|
__u32 nr;
|
2022-05-09 23:29:14 +08:00
|
|
|
__u32 flags;
|
io_uring: change registration/upd/rsrc tagging ABI
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.
It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.
Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 23:37:37 +08:00
|
|
|
__u64 resv2;
|
2021-04-25 21:32:21 +08:00
|
|
|
__aligned_u64 data;
|
|
|
|
__aligned_u64 tags;
|
|
|
|
};
|
|
|
|
|
2021-04-25 21:32:22 +08:00
|
|
|
struct io_uring_rsrc_update {
|
|
|
|
__u32 offset;
|
|
|
|
__u32 resv;
|
|
|
|
__aligned_u64 data;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct io_uring_rsrc_update2 {
|
|
|
|
__u32 offset;
|
|
|
|
__u32 resv;
|
|
|
|
__aligned_u64 data;
|
|
|
|
__aligned_u64 tags;
|
|
|
|
__u32 nr;
|
io_uring: change registration/upd/rsrc tagging ABI
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.
It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.
Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 23:37:37 +08:00
|
|
|
__u32 resv2;
|
2021-04-25 21:32:22 +08:00
|
|
|
};
|
|
|
|
|
2022-07-13 04:52:42 +08:00
|
|
|
struct io_uring_notification_slot {
|
|
|
|
__u64 tag;
|
|
|
|
__u64 resv[3];
|
|
|
|
};
|
|
|
|
|
|
|
|
struct io_uring_notification_register {
|
|
|
|
__u32 nr_slots;
|
|
|
|
__u32 resv;
|
|
|
|
__u64 resv2;
|
|
|
|
__u64 data;
|
|
|
|
__u64 resv3;
|
|
|
|
};
|
|
|
|
|
2021-01-27 04:23:28 +08:00
|
|
|
/* Skip updating fd indexes set to this value in the fd table */
|
|
|
|
#define IORING_REGISTER_FILES_SKIP (-2)
|
|
|
|
|
2020-01-17 06:36:52 +08:00
|
|
|
#define IO_URING_OP_SUPPORTED (1U << 0)
|
|
|
|
|
|
|
|
struct io_uring_probe_op {
|
|
|
|
__u8 op;
|
|
|
|
__u8 resv;
|
|
|
|
__u16 flags; /* IO_URING_OP_* flags */
|
|
|
|
__u32 resv2;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct io_uring_probe {
|
|
|
|
__u8 last_op; /* last opcode supported */
|
|
|
|
__u8 ops_len; /* length of ops[] array below */
|
|
|
|
__u16 resv;
|
|
|
|
__u32 resv2[3];
|
2022-06-29 03:33:20 +08:00
|
|
|
struct io_uring_probe_op ops[];
|
2020-01-17 06:36:52 +08:00
|
|
|
};
|
|
|
|
|
2020-08-27 22:58:30 +08:00
|
|
|
struct io_uring_restriction {
|
|
|
|
__u16 opcode;
|
|
|
|
union {
|
|
|
|
__u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
|
|
|
|
__u8 sqe_op; /* IORING_RESTRICTION_SQE_OP */
|
|
|
|
__u8 sqe_flags; /* IORING_RESTRICTION_SQE_FLAGS_* */
|
|
|
|
};
|
|
|
|
__u8 resv;
|
|
|
|
__u32 resv2[3];
|
|
|
|
};
|
|
|
|
|
io_uring: add support for ring mapped supplied buffers
Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.
Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.
Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.
To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:
Test Replenish NOPs/sec
================================================================
No provided buffers NA ~30M
Provided buffers 32 ~16M
Provided buffers 1 ~10M
Ring buffers 32 ~27M
Ring buffers 1 ~27M
The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.
Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-01 04:38:53 +08:00
|
|
|
struct io_uring_buf {
|
|
|
|
__u64 addr;
|
|
|
|
__u32 len;
|
|
|
|
__u16 bid;
|
|
|
|
__u16 resv;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct io_uring_buf_ring {
|
|
|
|
union {
|
|
|
|
/*
|
|
|
|
* To avoid spilling into more pages than we need to, the
|
|
|
|
* ring tail is overlaid with the io_uring_buf->resv field.
|
|
|
|
*/
|
|
|
|
struct {
|
|
|
|
__u64 resv1;
|
|
|
|
__u32 resv2;
|
|
|
|
__u16 resv3;
|
|
|
|
__u16 tail;
|
|
|
|
};
|
|
|
|
struct io_uring_buf bufs[0];
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
|
|
|
/* argument for IORING_(UN)REGISTER_PBUF_RING */
|
|
|
|
struct io_uring_buf_reg {
|
|
|
|
__u64 ring_addr;
|
|
|
|
__u32 ring_entries;
|
|
|
|
__u16 bgid;
|
|
|
|
__u16 pad;
|
|
|
|
__u64 resv[3];
|
|
|
|
};
|
|
|
|
|
2020-08-27 22:58:30 +08:00
|
|
|
/*
|
|
|
|
* io_uring_restriction->opcode values
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
/* Allow an io_uring_register(2) opcode */
|
|
|
|
IORING_RESTRICTION_REGISTER_OP = 0,
|
|
|
|
|
|
|
|
/* Allow an sqe opcode */
|
|
|
|
IORING_RESTRICTION_SQE_OP = 1,
|
|
|
|
|
|
|
|
/* Allow sqe flags */
|
|
|
|
IORING_RESTRICTION_SQE_FLAGS_ALLOWED = 2,
|
|
|
|
|
|
|
|
/* Require sqe flags (these flags must be set on each submission) */
|
|
|
|
IORING_RESTRICTION_SQE_FLAGS_REQUIRED = 3,
|
|
|
|
|
|
|
|
IORING_RESTRICTION_LAST
|
|
|
|
};
|
|
|
|
|
2020-11-03 10:54:37 +08:00
|
|
|
struct io_uring_getevents_arg {
|
|
|
|
__u64 sigmask;
|
|
|
|
__u32 sigmask_sz;
|
|
|
|
__u32 pad;
|
|
|
|
__u64 ts;
|
|
|
|
};
|
|
|
|
|
io_uring: add sync cancelation API through io_uring_register()
The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.
However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.
Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:
0 Request found and canceled fine.
> 0 Requests found and canceled. Only happens if asked to
cancel multiple requests, and if the work wasn't in
progress.
-ENOENT Request not found.
-ETIME A timeout on the operation was requested, but the timeout
expired before we could cancel.
and we won't get -EALREADY via this API.
If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.
Link: https://github.com/axboe/liburing/discussions/608
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-19 00:00:50 +08:00
|
|
|
/*
|
|
|
|
* Argument for IORING_REGISTER_SYNC_CANCEL
|
|
|
|
*/
|
|
|
|
struct io_uring_sync_cancel_reg {
|
|
|
|
__u64 addr;
|
|
|
|
__s32 fd;
|
|
|
|
__u32 flags;
|
|
|
|
struct __kernel_timespec timeout;
|
|
|
|
__u64 pad[4];
|
|
|
|
};
|
|
|
|
|
2022-06-25 18:55:38 +08:00
|
|
|
/*
|
|
|
|
* Argument for IORING_REGISTER_FILE_ALLOC_RANGE
|
|
|
|
* The range is specified as [off, off + len)
|
|
|
|
*/
|
|
|
|
struct io_uring_file_index_range {
|
|
|
|
__u32 off;
|
|
|
|
__u32 len;
|
|
|
|
__u64 resv;
|
|
|
|
};
|
|
|
|
|
2022-07-14 19:02:58 +08:00
|
|
|
struct io_uring_recvmsg_out {
|
|
|
|
__u32 namelen;
|
|
|
|
__u32 controllen;
|
|
|
|
__u32 payloadlen;
|
|
|
|
__u32 flags;
|
|
|
|
};
|
|
|
|
|
2022-08-23 19:45:49 +08:00
|
|
|
#ifdef __cplusplus
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
#endif
|