2020-03-21 19:19:07 +08:00
|
|
|
/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* Header file for the io_uring interface.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2019 Jens Axboe
|
|
|
|
* Copyright (C) 2019 Christoph Hellwig
|
|
|
|
*/
|
|
|
|
#ifndef LINUX_IO_URING_H
|
|
|
|
#define LINUX_IO_URING_H
|
|
|
|
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* IO submission data structure (Submission Queue Entry)
|
|
|
|
*/
|
|
|
|
struct io_uring_sqe {
|
|
|
|
__u8 opcode; /* type of operation for this sqe */
|
2019-01-11 13:13:58 +08:00
|
|
|
__u8 flags; /* IOSQE_ flags */
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
__u16 ioprio; /* ioprio for the request */
|
|
|
|
__s32 fd; /* file descriptor to do IO on */
|
2019-10-18 04:42:58 +08:00
|
|
|
union {
|
|
|
|
__u64 off; /* offset into file */
|
|
|
|
__u64 addr2;
|
|
|
|
};
|
2020-02-24 16:32:45 +08:00
|
|
|
union {
|
|
|
|
__u64 addr; /* pointer to buffer or iovecs */
|
|
|
|
__u64 splice_off_in;
|
|
|
|
};
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
__u32 len; /* buffer size or number of iovecs */
|
|
|
|
union {
|
|
|
|
__kernel_rwf_t rw_flags;
|
2019-01-12 00:43:02 +08:00
|
|
|
__u32 fsync_flags;
|
2020-06-17 17:53:55 +08:00
|
|
|
__u16 poll_events; /* compatibility */
|
|
|
|
__u32 poll32_events; /* word-reversed for BE */
|
2019-04-10 04:56:44 +08:00
|
|
|
__u32 sync_range_flags;
|
2019-04-20 03:34:07 +08:00
|
|
|
__u32 msg_flags;
|
2019-09-18 02:26:57 +08:00
|
|
|
__u32 timeout_flags;
|
2019-10-18 04:42:58 +08:00
|
|
|
__u32 accept_flags;
|
2019-10-29 11:49:21 +08:00
|
|
|
__u32 cancel_flags;
|
2019-12-12 02:20:36 +08:00
|
|
|
__u32 open_flags;
|
2019-12-14 12:18:10 +08:00
|
|
|
__u32 statx_flags;
|
2019-12-26 13:03:45 +08:00
|
|
|
__u32 fadvise_advice;
|
2020-02-24 16:32:45 +08:00
|
|
|
__u32 splice_flags;
|
2020-09-29 04:23:58 +08:00
|
|
|
__u32 rename_flags;
|
2020-09-29 04:27:37 +08:00
|
|
|
__u32 unlink_flags;
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
__u64 user_data; /* data to be passed back at completion time */
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
union {
|
2020-01-29 01:15:23 +08:00
|
|
|
struct {
|
2020-02-24 07:41:33 +08:00
|
|
|
/* pack this to avoid bogus arm OABI complaints */
|
|
|
|
union {
|
|
|
|
/* index into fixed buffers, if used */
|
|
|
|
__u16 buf_index;
|
|
|
|
/* for grouped buffer selection */
|
|
|
|
__u16 buf_group;
|
|
|
|
} __attribute__((packed));
|
2020-01-29 01:15:23 +08:00
|
|
|
/* personality to use, if used */
|
|
|
|
__u16 personality;
|
2020-02-24 16:32:45 +08:00
|
|
|
__s32 splice_fd_in;
|
2020-01-29 01:15:23 +08:00
|
|
|
};
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
__u64 __pad2[3];
|
|
|
|
};
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
|
2020-01-19 01:22:41 +08:00
|
|
|
enum {
|
|
|
|
IOSQE_FIXED_FILE_BIT,
|
|
|
|
IOSQE_IO_DRAIN_BIT,
|
|
|
|
IOSQE_IO_LINK_BIT,
|
|
|
|
IOSQE_IO_HARDLINK_BIT,
|
|
|
|
IOSQE_ASYNC_BIT,
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
IOSQE_BUFFER_SELECT_BIT,
|
2020-01-19 01:22:41 +08:00
|
|
|
};
|
|
|
|
|
2019-01-11 13:13:58 +08:00
|
|
|
/*
|
|
|
|
* sqe->flags
|
|
|
|
*/
|
2020-01-19 01:22:41 +08:00
|
|
|
/* use fixed fileset */
|
|
|
|
#define IOSQE_FIXED_FILE (1U << IOSQE_FIXED_FILE_BIT)
|
|
|
|
/* issue after inflight IO */
|
|
|
|
#define IOSQE_IO_DRAIN (1U << IOSQE_IO_DRAIN_BIT)
|
|
|
|
/* links next sqe */
|
|
|
|
#define IOSQE_IO_LINK (1U << IOSQE_IO_LINK_BIT)
|
|
|
|
/* like LINK, but stronger */
|
|
|
|
#define IOSQE_IO_HARDLINK (1U << IOSQE_IO_HARDLINK_BIT)
|
|
|
|
/* always go async */
|
|
|
|
#define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
/* select buffer from sqe->buf_group */
|
|
|
|
#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
|
2019-01-11 13:13:58 +08:00
|
|
|
|
2019-01-09 23:59:42 +08:00
|
|
|
/*
|
|
|
|
* io_uring_setup() flags
|
|
|
|
*/
|
|
|
|
#define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
#define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */
|
|
|
|
#define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */
|
2019-10-05 02:10:03 +08:00
|
|
|
#define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */
|
2019-12-29 06:39:54 +08:00
|
|
|
#define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */
|
2020-01-28 08:15:48 +08:00
|
|
|
#define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */
|
2020-08-27 22:58:31 +08:00
|
|
|
#define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */
|
2019-01-09 23:59:42 +08:00
|
|
|
|
2019-12-12 06:55:43 +08:00
|
|
|
enum {
|
|
|
|
IORING_OP_NOP,
|
|
|
|
IORING_OP_READV,
|
|
|
|
IORING_OP_WRITEV,
|
|
|
|
IORING_OP_FSYNC,
|
|
|
|
IORING_OP_READ_FIXED,
|
|
|
|
IORING_OP_WRITE_FIXED,
|
|
|
|
IORING_OP_POLL_ADD,
|
|
|
|
IORING_OP_POLL_REMOVE,
|
|
|
|
IORING_OP_SYNC_FILE_RANGE,
|
|
|
|
IORING_OP_SENDMSG,
|
|
|
|
IORING_OP_RECVMSG,
|
|
|
|
IORING_OP_TIMEOUT,
|
|
|
|
IORING_OP_TIMEOUT_REMOVE,
|
|
|
|
IORING_OP_ACCEPT,
|
|
|
|
IORING_OP_ASYNC_CANCEL,
|
|
|
|
IORING_OP_LINK_TIMEOUT,
|
|
|
|
IORING_OP_CONNECT,
|
2019-12-11 01:38:56 +08:00
|
|
|
IORING_OP_FALLOCATE,
|
2019-12-12 02:20:36 +08:00
|
|
|
IORING_OP_OPENAT,
|
2019-12-12 05:02:38 +08:00
|
|
|
IORING_OP_CLOSE,
|
2019-12-10 02:22:50 +08:00
|
|
|
IORING_OP_FILES_UPDATE,
|
2019-12-14 12:18:10 +08:00
|
|
|
IORING_OP_STATX,
|
2019-12-23 06:19:35 +08:00
|
|
|
IORING_OP_READ,
|
|
|
|
IORING_OP_WRITE,
|
2019-12-26 13:03:45 +08:00
|
|
|
IORING_OP_FADVISE,
|
2019-12-26 13:18:28 +08:00
|
|
|
IORING_OP_MADVISE,
|
2020-01-05 11:19:44 +08:00
|
|
|
IORING_OP_SEND,
|
|
|
|
IORING_OP_RECV,
|
2020-01-09 08:59:24 +08:00
|
|
|
IORING_OP_OPENAT2,
|
2020-01-09 06:18:09 +08:00
|
|
|
IORING_OP_EPOLL_CTL,
|
2020-02-24 16:32:45 +08:00
|
|
|
IORING_OP_SPLICE,
|
2020-02-24 07:41:33 +08:00
|
|
|
IORING_OP_PROVIDE_BUFFERS,
|
2020-03-03 07:32:28 +08:00
|
|
|
IORING_OP_REMOVE_BUFFERS,
|
2020-05-17 19:18:06 +08:00
|
|
|
IORING_OP_TEE,
|
2020-09-06 01:14:22 +08:00
|
|
|
IORING_OP_SHUTDOWN,
|
2020-09-29 04:23:58 +08:00
|
|
|
IORING_OP_RENAMEAT,
|
2020-09-29 04:27:37 +08:00
|
|
|
IORING_OP_UNLINKAT,
|
2019-12-12 06:55:43 +08:00
|
|
|
|
|
|
|
/* this goes last, obviously */
|
|
|
|
IORING_OP_LAST,
|
|
|
|
};
|
2019-01-12 00:43:02 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* sqe->fsync_flags
|
|
|
|
*/
|
|
|
|
#define IORING_FSYNC_DATASYNC (1U << 0)
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
|
2019-10-16 06:48:15 +08:00
|
|
|
/*
|
|
|
|
* sqe->timeout_flags
|
|
|
|
*/
|
|
|
|
#define IORING_TIMEOUT_ABS (1U << 0)
|
2020-12-01 03:11:16 +08:00
|
|
|
#define IORING_TIMEOUT_UPDATE (1U << 1)
|
2019-10-16 06:48:15 +08:00
|
|
|
|
2020-02-24 16:32:45 +08:00
|
|
|
/*
|
|
|
|
* sqe->splice_flags
|
|
|
|
* extends splice(2) flags
|
|
|
|
*/
|
|
|
|
#define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* IO completion data structure (Completion Queue Entry)
|
|
|
|
*/
|
|
|
|
struct io_uring_cqe {
|
|
|
|
__u64 user_data; /* sqe->data submission passed back */
|
|
|
|
__s32 res; /* result code for this event */
|
|
|
|
__u32 flags;
|
|
|
|
};
|
|
|
|
|
io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.
Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 07:42:51 +08:00
|
|
|
/*
|
|
|
|
* cqe->flags
|
|
|
|
*
|
|
|
|
* IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
|
|
|
|
*/
|
|
|
|
#define IORING_CQE_F_BUFFER (1U << 0)
|
|
|
|
|
|
|
|
enum {
|
|
|
|
IORING_CQE_BUFFER_SHIFT = 16,
|
|
|
|
};
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* Magic offsets for the application to mmap the data it needs
|
|
|
|
*/
|
|
|
|
#define IORING_OFF_SQ_RING 0ULL
|
|
|
|
#define IORING_OFF_CQ_RING 0x8000000ULL
|
|
|
|
#define IORING_OFF_SQES 0x10000000ULL
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Filled with the offset for mmap(2)
|
|
|
|
*/
|
|
|
|
struct io_sqring_offsets {
|
|
|
|
__u32 head;
|
|
|
|
__u32 tail;
|
|
|
|
__u32 ring_mask;
|
|
|
|
__u32 ring_entries;
|
|
|
|
__u32 flags;
|
|
|
|
__u32 dropped;
|
|
|
|
__u32 array;
|
|
|
|
__u32 resv1;
|
|
|
|
__u64 resv2;
|
|
|
|
};
|
|
|
|
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
/*
|
|
|
|
* sq_ring->flags
|
|
|
|
*/
|
|
|
|
#define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */
|
io_uring: export cq overflow status to userspace
For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:
static void test_cq_overflow(struct io_uring *ring)
{
struct io_uring_cqe *cqe;
struct io_uring_sqe *sqe;
int issued = 0;
int ret = 0;
do {
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
break;;
}
ret = io_uring_submit(ring);
if (ret <= 0) {
if (ret != -EBUSY)
fprintf(stderr, "sqe submit failed: %d\n", ret);
break;
}
issued++;
} while (ret > 0);
assert(ret == -EBUSY);
printf("issued requests: %d\n", issued);
while (issued) {
ret = io_uring_peek_cqe(ring, &cqe);
if (ret) {
if (ret != -EAGAIN) {
fprintf(stderr, "peek completion failed: %s\n",
strerror(ret));
break;
}
printf("left requets: %d\n", issued);
continue;
}
io_uring_cqe_seen(ring, cqe);
issued--;
printf("left requets: %d\n", issued);
}
}
int main(int argc, char *argv[])
{
int ret;
struct io_uring ring;
ret = io_uring_queue_init(16, &ring, 0);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
test_cq_overflow(&ring);
return 0;
}
To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-09 09:15:29 +08:00
|
|
|
#define IORING_SQ_CQ_OVERFLOW (1U << 1) /* CQ ring is overflown */
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
struct io_cqring_offsets {
|
|
|
|
__u32 head;
|
|
|
|
__u32 tail;
|
|
|
|
__u32 ring_mask;
|
|
|
|
__u32 ring_entries;
|
|
|
|
__u32 overflow;
|
|
|
|
__u32 cqes;
|
2020-05-16 00:38:04 +08:00
|
|
|
__u32 flags;
|
|
|
|
__u32 resv1;
|
|
|
|
__u64 resv2;
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
};
|
|
|
|
|
2020-05-16 00:38:05 +08:00
|
|
|
/*
|
|
|
|
* cq_ring->flags
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* disable eventfd notifications */
|
|
|
|
#define IORING_CQ_EVENTFD_DISABLED (1U << 0)
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
/*
|
|
|
|
* io_uring_enter(2) flags
|
|
|
|
*/
|
|
|
|
#define IORING_ENTER_GETEVENTS (1U << 0)
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
#define IORING_ENTER_SQ_WAKEUP (1U << 1)
|
2020-09-04 02:12:41 +08:00
|
|
|
#define IORING_ENTER_SQ_WAIT (1U << 2)
|
2020-11-03 10:54:37 +08:00
|
|
|
#define IORING_ENTER_EXT_ARG (1U << 3)
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Passed in for io_uring_setup(2). Copied back with updated info on success
|
|
|
|
*/
|
|
|
|
struct io_uring_params {
|
|
|
|
__u32 sq_entries;
|
|
|
|
__u32 cq_entries;
|
|
|
|
__u32 flags;
|
io_uring: add submission polling
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:
read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-11 02:22:30 +08:00
|
|
|
__u32 sq_thread_cpu;
|
|
|
|
__u32 sq_thread_idle;
|
2019-09-07 00:26:21 +08:00
|
|
|
__u32 features;
|
2020-01-28 08:15:48 +08:00
|
|
|
__u32 wq_fd;
|
|
|
|
__u32 resv[3];
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
struct io_sqring_offsets sq_off;
|
|
|
|
struct io_cqring_offsets cq_off;
|
|
|
|
};
|
|
|
|
|
2019-09-07 00:26:21 +08:00
|
|
|
/*
|
|
|
|
* io_uring_params->features flags
|
|
|
|
*/
|
|
|
|
#define IORING_FEAT_SINGLE_MMAP (1U << 0)
|
io_uring: add support for backlogged CQ ring
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.
After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.
Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 02:31:17 +08:00
|
|
|
#define IORING_FEAT_NODROP (1U << 1)
|
2019-12-03 09:51:26 +08:00
|
|
|
#define IORING_FEAT_SUBMIT_STABLE (1U << 2)
|
2019-12-26 07:33:42 +08:00
|
|
|
#define IORING_FEAT_RW_CUR_POS (1U << 3)
|
2020-01-28 07:34:48 +08:00
|
|
|
#define IORING_FEAT_CUR_PERSONALITY (1U << 4)
|
io_uring: use poll driven retry for files that support it
Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.
This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.
The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-15 13:23:12 +08:00
|
|
|
#define IORING_FEAT_FAST_POLL (1U << 5)
|
2020-06-17 17:53:55 +08:00
|
|
|
#define IORING_FEAT_POLL_32BITS (1U << 6)
|
2020-09-15 00:51:17 +08:00
|
|
|
#define IORING_FEAT_SQPOLL_NONFIXED (1U << 7)
|
2020-11-03 10:54:37 +08:00
|
|
|
#define IORING_FEAT_EXT_ARG (1U << 8)
|
2021-02-21 02:55:28 +08:00
|
|
|
#define IORING_FEAT_NATIVE_WORKERS (1U << 9)
|
2019-09-07 00:26:21 +08:00
|
|
|
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
/*
|
|
|
|
* io_uring_register(2) opcodes and arguments
|
|
|
|
*/
|
2020-08-27 22:58:29 +08:00
|
|
|
enum {
|
|
|
|
IORING_REGISTER_BUFFERS = 0,
|
|
|
|
IORING_UNREGISTER_BUFFERS = 1,
|
|
|
|
IORING_REGISTER_FILES = 2,
|
|
|
|
IORING_UNREGISTER_FILES = 3,
|
|
|
|
IORING_REGISTER_EVENTFD = 4,
|
|
|
|
IORING_UNREGISTER_EVENTFD = 5,
|
|
|
|
IORING_REGISTER_FILES_UPDATE = 6,
|
|
|
|
IORING_REGISTER_EVENTFD_ASYNC = 7,
|
|
|
|
IORING_REGISTER_PROBE = 8,
|
|
|
|
IORING_REGISTER_PERSONALITY = 9,
|
|
|
|
IORING_UNREGISTER_PERSONALITY = 10,
|
2020-08-27 22:58:30 +08:00
|
|
|
IORING_REGISTER_RESTRICTIONS = 11,
|
2020-08-27 22:58:31 +08:00
|
|
|
IORING_REGISTER_ENABLE_RINGS = 12,
|
2020-08-27 22:58:29 +08:00
|
|
|
|
|
|
|
/* this goes last */
|
|
|
|
IORING_REGISTER_LAST
|
|
|
|
};
|
2019-10-04 03:59:56 +08:00
|
|
|
|
2021-01-16 01:37:44 +08:00
|
|
|
/* deprecated, see struct io_uring_rsrc_update */
|
2019-10-04 03:59:56 +08:00
|
|
|
struct io_uring_files_update {
|
|
|
|
__u32 offset;
|
2020-01-16 00:35:38 +08:00
|
|
|
__u32 resv;
|
|
|
|
__aligned_u64 /* __s32 * */ fds;
|
2019-10-04 03:59:56 +08:00
|
|
|
};
|
io_uring: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.
To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.
It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-10 00:16:05 +08:00
|
|
|
|
2021-01-16 01:37:44 +08:00
|
|
|
struct io_uring_rsrc_update {
|
|
|
|
__u32 offset;
|
|
|
|
__u32 resv;
|
|
|
|
__aligned_u64 data;
|
|
|
|
};
|
|
|
|
|
2021-01-27 04:23:28 +08:00
|
|
|
/* Skip updating fd indexes set to this value in the fd table */
|
|
|
|
#define IORING_REGISTER_FILES_SKIP (-2)
|
|
|
|
|
2020-01-17 06:36:52 +08:00
|
|
|
#define IO_URING_OP_SUPPORTED (1U << 0)
|
|
|
|
|
|
|
|
struct io_uring_probe_op {
|
|
|
|
__u8 op;
|
|
|
|
__u8 resv;
|
|
|
|
__u16 flags; /* IO_URING_OP_* flags */
|
|
|
|
__u32 resv2;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct io_uring_probe {
|
|
|
|
__u8 last_op; /* last opcode supported */
|
|
|
|
__u8 ops_len; /* length of ops[] array below */
|
|
|
|
__u16 resv;
|
|
|
|
__u32 resv2[3];
|
|
|
|
struct io_uring_probe_op ops[0];
|
|
|
|
};
|
|
|
|
|
2020-08-27 22:58:30 +08:00
|
|
|
struct io_uring_restriction {
|
|
|
|
__u16 opcode;
|
|
|
|
union {
|
|
|
|
__u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
|
|
|
|
__u8 sqe_op; /* IORING_RESTRICTION_SQE_OP */
|
|
|
|
__u8 sqe_flags; /* IORING_RESTRICTION_SQE_FLAGS_* */
|
|
|
|
};
|
|
|
|
__u8 resv;
|
|
|
|
__u32 resv2[3];
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* io_uring_restriction->opcode values
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
/* Allow an io_uring_register(2) opcode */
|
|
|
|
IORING_RESTRICTION_REGISTER_OP = 0,
|
|
|
|
|
|
|
|
/* Allow an sqe opcode */
|
|
|
|
IORING_RESTRICTION_SQE_OP = 1,
|
|
|
|
|
|
|
|
/* Allow sqe flags */
|
|
|
|
IORING_RESTRICTION_SQE_FLAGS_ALLOWED = 2,
|
|
|
|
|
|
|
|
/* Require sqe flags (these flags must be set on each submission) */
|
|
|
|
IORING_RESTRICTION_SQE_FLAGS_REQUIRED = 3,
|
|
|
|
|
|
|
|
IORING_RESTRICTION_LAST
|
|
|
|
};
|
|
|
|
|
2020-11-03 10:54:37 +08:00
|
|
|
struct io_uring_getevents_arg {
|
|
|
|
__u64 sigmask;
|
|
|
|
__u32 sigmask_sz;
|
|
|
|
__u32 pad;
|
|
|
|
__u64 ts;
|
|
|
|
};
|
|
|
|
|
Add io_uring IO interface
The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.
IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.
Two new system calls are added for this:
io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.
io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.
With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.
For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.
Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.
Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-08 01:46:33 +08:00
|
|
|
#endif
|