If IORING_FILE_INDEX_ALLOC is set asking for an allocated slot, the
helper doesn't check if we actually have a file table or not. The non
alloc path does do that correctly, and returns -ENXIO if we haven't set
one up.
Do the same for the allocated path, avoiding a NULL pointer dereference
when trying to find a free bit.
Fixes: a7c41b4687 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
32 bit sqe->cmd_op is an union with 64 bit values. It's always a good
idea to do padding explicitly. Also zero check it in prep, so it can be
used in the future if needed without compatibility concerns.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6b95a05e970af79000435166185e85b196b2ba2.1657202417.git.asml.silence@gmail.com
[axboe: turn bitwise OR into logical variant]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_import_iovec uses the s pointer, but this was changed immediately
after the iovec was re-imported and so it was imported into the wrong
place.
Change the ordering.
Fixes: 2be2eb02e2 ("io_uring: ensure reads re-import for selected buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630132006.2825668-1-dylany@fb.com
[axboe: ensure we don't half-import as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We waste a u64 SQE field for flags even though we don't need as many
bits and it can be used for something more useful later. Store io_uring
specific send/recv flags in sqe->ioprio instead of ->addr2.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 0455d4ccec ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg")
[axboe: change comment in io_uring.h as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In prior kernels, we did file assignment always at prep time. This meant
that req->task == current. But after deferring that assignment and then
pushing the inflight tracking back in, we've got the inflight tracking
using current when it should in fact now be using req->task.
Fixup that error introduced by adding the inflight tracking back after
file assignments got modifed.
Fixes: 9cae36a094 ("io_uring: reinstate the inflight tracking")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have re-polling for partial IO, so a request can be polled twice. If
it used two poll entries the first time then on the second
io_arm_poll_handler() it will find the old apoll entry and NULL
kmalloc()'ed second entry, i.e. apoll->double_poll, so leaking it.
Fixes: 10c873334f ("io_uring: allow re-poll if we made progress")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fee2452494222ecc7f1f88c8fb659baef971414a.1655852245.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
apoll_events should be set once in the beginning of poll arming just as
poll->events and not change after. However, currently io_uring resets it
on each __io_poll_execute() for no clear reason. There is also a place
in __io_arm_poll_handler() where we add EPOLLONESHOT to downgrade a
multishot, but forget to do the same thing with ->apoll_events, which is
buggy.
Fixes: 81459350d5 ("io_uring: cache req->apoll->events in req->cflags")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/0aef40399ba75b1a4d2c2e85e6e8fd93c02fc6e4.1655814213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the dropping of the IOPOLL checking in the per-opcode handlers,
we inadvertently left two checks in the recv/recvmsg and send/sendmsg
prep handlers for the same thing, and one of them includes addr2 which
holds the flags for these opcodes.
Fix it up and kill the redundant checks.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we mark for reissue, we assume that the buffer will remain stable.
Hence if are using a provided buffer, we need to ensure that we stick
with it for the duration of that request.
This only affects block devices that use provided buffers, as those are
the only ones that get marked with REQ_F_REISSUE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_arm_poll_handler() will recycle the buffer appropriately if we end
up arming poll (or if we're ready to retry), but not for the io-wq case
if we have attempted poll first.
Explicitly recycle the buffer to avoid both hanging on to it too long,
but also to avoid multiple reads grabbing the same one. This can happen
for ring mapped buffers, since it hasn't necessarily been committed.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Link: https://github.com/axboe/liburing/issues/605
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_task_prio_work_add has a strict assumption that it will only be
used with io_req_task_complete. There is a codepath that assumes this is
the case and will not even call the completion function if it is hit.
For uring_cmd with an arbitrary completion function change the call to the
correct non-priority version.
Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220616135011.441980-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For recv/recvmsg, IO either completes immediately or gets queued for a
retry. This isn't the case for read/readv, if eg a normal file or a block
device is used. Here, an operation can get queued with the block layer.
If this happens, ring mapped buffers must get committed immediately to
avoid that the next read can consume the same buffer.
Check if we're dealing with pollable file, when getting a new ring mapped
provided buffer. If it's not, commit it immediately rather than wait post
issue. If we don't wait, we can race with completions coming in, or just
plain buffer reuse by committing after a retry where others could have
grabbed the same buffer.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't really know the state of req->extra{1,2] fields in
__io_fill_cqe_req(), if an opcode handler is not aware of CQE32 option,
it never sets them up properly. Track the state of those fields with a
request flag.
Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4b3e5be512fbf4debec7270fd485b8a3b014d464.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only user of io_req_complete32()-like functions is cmd
requests. Instead of keeping the whole complete32 family, remove them
and provide the extras in already added for inline completions
req->extra{1,2}. When fill_cqe_res() finds CQE32 option enabled
it'll use those fields to fill a 32B cqe.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/af1319eb661b1f9a0abceb51cbbf72b8002e019d.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want just one function that will handle both normal cqes and 32B
cqes. Combine __io_fill_cqe_req() and __io_fill_cqe_req32(). It's still
not entirely correct yet, but saves us from cases when we fill an CQE of
a wrong size.
Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8085c5b2f74141520f60decd45334f87e389b718.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This partially reverts a7c41b4687
Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some
users, but it tries to do two things at a time and it's not clear how to
handle errors and what to return in a single result field when one part
fails and another completes well. Kill it for now.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/837c745019b3795941eee4fcfd7de697886d645b.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring fixes from Pavel.
* 'io_uring/io_uring-5.19' of https://github.com/isilence/linux:
io_uring: fix double unlock for pbuf select
io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
io_uring: openclose: fix bug of closing wrong fixed file
io_uring: fix not locked access to fixed buf table
io_uring: fix races with buffer table unregister
io_uring: fix races with file table unregister
The type of head and tail do not allow more than 2^15 entries in a
provided buffer ring, so do not allow this.
At 2^16 while each entry can be indexed, there is no way to
disambiguate full vs empty.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-4-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The type of head needs to match that of tail in order for rollover and
comparisons to work correctly.
Without this change the comparison of tail to head might incorrectly allow
io_uring to use a buffer that userspace had not given it.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-3-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When indexing into a provided buffer ring, do not subtract 1 from the
index.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-2-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_buffer_select(), which is the only caller of io_ring_buffer_select(),
fully handles locking, mutex unlock in io_ring_buffer_select() will lead
to double unlock.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
When we use ring-mapped provided buffer, we should consume it before
arm poll if partial io has been done. Otherwise the buffer may be used
by other requests and thus we lost the data.
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Hao Xu <howeyxu@tencent.com>
[pavel: 5.19 rebase]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Don't update ret until fixed file is closed, otherwise the file slot
becomes the error code.
Fixes: a7c41b4687 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots")
Signed-off-by: Hao Xu <howeyxu@tencent.com>
[pavel: 5.19 rebase]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
We can look inside the fixed buffer table only while holding
->uring_lock, however in some cases we don't do the right async prep for
IORING_OP_{WRITE,READ}_FIXED ending up with NULL req->imu forcing making
an io-wq worker to try to resolve the fixed buffer without proper
locking.
Move req->imu setup into early req init paths, i.e. io_prep_rw(), which
is called unconditionally for rw requests and under uring_lock.
Fixes: 634d00df5e ("io_uring: add full-fledged dynamic buffers support")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixed buffer table quiesce might unlock ->uring_lock, potentially
letting new requests to be submitted, don't allow those requests to
use the table as they will race with unregistration.
Reported-and-tested-by: van fantasy <g1042620637@gmail.com>
Fixes: bd54b6fe33 ("io_uring: implement fixed buffers registration similar to fixed files")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixed file table quiesce might unlock ->uring_lock, potentially letting
new requests to be submitted, don't allow those requests to use the
table as they will race with unregistration.
Reported-and-tested-by: van fantasy <g1042620637@gmail.com>
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpz/IQAKCRBZ7Krx/gZQ
666mAPwKOC/voemjl2m+RpSruxAbdlRvKei3IY8YxLfyv+rmUQD9HKLJJtQX2VRF
QTFmQ3p7kx30ydwSbyY8Kpw3VMCDxgs=
=1ZKm
-----END PGP SIGNATURE-----
Merge tag 'pull-work.fd-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull file descriptor fix from Al Viro:
"Fix for breakage in #work.fd this window"
* tag 'pull-work.fd-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the breakage in close_fd_get_file() calling conventions change
It used to grab an extra reference to struct file rather than
just transferring to caller the one it had removed from descriptor
table. New variant doesn't, and callers need to be adjusted.
Reported-and-tested-by: syzbot+47dd250f527cb7bebf24@syzkaller.appspotmail.com
Fixes: 6319194ec5 ("Unify the primitives for file descriptor closing")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After some debugging, it was realized that we really do still need the
old inflight tracking for any file type that has io_uring_fops assigned.
If we don't, then trivial circular references will mean that we never get
the ctx cleaned up and hence it'll leak.
Just bring back the inflight tracking, which then also means we can
eliminate the conditional dropping of the file when task_work is queued.
Fixes: d5361233e9 ("io_uring: drop the old style inflight file tracking")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_fixed_fd_install() can grab uring_lock in the slot allocation path
when called from io-wq, and then call into io_install_fixed_file(),
which will lock it again. Pull all locking out of
io_install_fixed_file() into io_fixed_fd_install().
Fixes: 1339f24b33 ("io_uring: allow allocated fixed files for openat/openat2")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/64116172a9d0b85b85300346bb280f3657aafc26.1654087283.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One big issue with the file registration feature is that it needs user
space apps to maintain free slot info about io_uring's fixed file table,
which really is a burden for development. io_uring now supports choosing
free file slot for user space apps by using IORING_FILE_INDEX_ALLOC flag
in accept, open, and socket operations, but they need the app to use
direct accept or direct open, which not all apps are prepared to use yet.
To support apps that still need real fds, make use of the registration
feature easier. Let IORING_OP_FILES_UPDATE support choosing fixed file
slots, which will store picked fixed files slots in fd array and let cqe
return the number of slots allocated.
Suggested-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
[axboe: move flag to uapi io_uring header, change goto to break, init]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_file_bitmap_get() returns a free bitmap slot, but if it isn't
used later, such as io_queue_rsrc_removal() returns error, in this
case, we should not update alloc_hint at all, which still should
be considered as a valid candidate for next io_file_bitmap_get()
calls.
To fix this issue, only update alloc_hint in io_file_bitmap_set().
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220528015109.48039-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The socket support was merged in an earlier branch that didn't yet
have support for allocating direct descriptors, hence only open
and accept got support for that.
Do the one-liner to enable it now, so we have consistent support for
any request that can instantiate a file/direct descriptor.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we use a buffer group ID that is large enough to require io_uring
to allocate it, then we don't correctly free it if the cleanup is
deferred to the ring exit. The explicit removal paths are fine.
Fixes: 9cfc7e94e4 ("io_uring: get rid of hashed provided buffer groups")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make them consistent in preparation for defining a req async prep
handler. The readv/writev requests share a prep handler, move it one
level down so the initial one is consistent with the others.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Define and set it when appropriate, and use it consistently in the
function rather than using io_op_defs[opcode].
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Almost all of them are, the odd ones out are the poll remove and the
files update request. Name them like the others, which is:
io_#cmdname_prep for request preparation
io_#cmdname for request issue
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All other opcodes take a {req, sqe} set for prep handling, split out
a timeout prep handler so that timeout and linked timeouts can use
the same one.
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKovAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv9oD/4qCs7k3bPZZWZ6xoWb4EObyyWOUifi26lp
vpsJHFUbA67S/i4++LV9H18SazWJ7h08ac4bjgZ+NQz40/1WkTN8/Fa76jo+BnNK
7T10Wp4Ak6uwWVrKaA81pnT+G9+xmHlJ3X27aKxzLuT7BEPpShZ6ouFVjTkx9CzN
LrLjuCDTOBBN+ZoaroWYfdLwTQX2VCAl9B15lOtQIlFvuuU8VlrvLboY+80K8TvY
1wvTA2HTjnXoYx+/cTTMIFZIwQH3r1hsbwEDD8/YJj1+ouhSRQ1b0p/nk2pA+3ws
HF5r/YS/rLBjlPF094IzeOBaUyA433AN1VhZqnII8ek7ViT3W3x+BRrgE9O6ZkWT
0AjX1BXReI5rdFmxBmwsSdBnrSoGaJOf2GdsCCdubXBIi+F/RvyajrPf7PTB5zbW
9WEK/uy3xvZsRVkUGAzOb9QGdvjcllgMzwPJsDegDCw5PdcPdT3mzy6KGIWipFLp
j8R+br7hRMpOJv/YpihJDMzSDkQ/r1/SCwR4fpLid/QdSHG/eRTQK6c4Su5bNYEy
QDy2F6kQdBVtEJCQHcEOsbhXzSTNBcdB+ujUUM5653FkaHe6y4JbomLrsNx407Id
i/4ROwA5K1dioJx503Eap+OhbI5rV+PFytJTwxvLrNyVGccwbH2YOVq80fsVBP2e
cZbn6EX4Vg==
=/peE
-----END PGP SIGNATURE-----
Merge tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block
Pull io_uring NVMe command passthrough from Jens Axboe:
"On top of everything else, this adds support for passthrough for
io_uring.
The initial feature for this is NVMe passthrough support, which allows
non-filesystem based IO commands and admin commands.
To support this, io_uring grows support for SQE and CQE members that
are twice as big, allowing to pass in a full NVMe command without
having to copy data around. And to complete with more than just a
single 32-bit value as the output"
* tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block: (22 commits)
io_uring: cleanup handling of the two task_work lists
nvme: enable uring-passthrough for admin commands
nvme: helper for uring-passthrough checks
blk-mq: fix passthrough plugging
nvme: add vectored-io support for uring-cmd
nvme: wire-up uring-cmd support for io-passthru on char-device.
nvme: refactor nvme_submit_user_cmd()
block: wire-up support for passthrough plugging
fs,io_uring: add infrastructure for uring-cmd
io_uring: support CQE32 for nop operation
io_uring: enable CQE32
io_uring: support CQE32 in /proc info
io_uring: add tracing for additional CQE32 fields
io_uring: overflow processing for CQE32
io_uring: flush completions for CQE32
io_uring: modify io_get_cqe for CQE32
io_uring: add CQE32 completion processing
io_uring: add CQE32 setup processing
io_uring: change ring size calculation for CQE32
io_uring: store add. return values for CQE32
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKotMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmVwEACo7qBTjrrneZEwlYUWrSr45QtDNsQHPWjv
aoK1dBLVH4ZjoZoOTI/aYcRgd5IJYo1P6I9tUrolM/+N3adM4UTEVC7i2PYDOaL3
WUm/YT2aSLiyHaHQON7SMyGSVU8kfM9YvJAGbj7ohalO9A2VVtHfUAmcAtBdgWqv
Dl/Uu6vbogOl19xztAwN4nvwqljA+GUMnbHJ/oeASzrMzYMOdQ0q3UsQbEt+pTXt
rBzv8fCsrKsT2uBc59Bi3eFKeBMM6ERzux/40TlqcOnXf3KUCK7nM4VaRgPbvXdt
GOOYfYs+j9L8SSEedvdKyYNq4vVwWgYfTRAKMNB0FPiOaTGZuUthqkgRZGYY8AA9
+lJWxa+mzPmWEOmL+E44kt0OwtKDHX72ccEJUD7PHhTp0g87yKZfS6mXRNYLSxm7
IYt7N1x3cOp0lrwUTvLDnSPOTuYOSEiB2JZtfkf+y3SuI5SWowIcudKOuO5p7G1r
IpAROsZrpHzMf/eniINoX3IrqBSqr254jzwq+9IgUaw/ky76oPYqM1dWP9BnVxCg
PXgvfT5zj6xrU43TxTeIPU92JoAqhMeXi6dcyoiAAf9+8Vih+sbmLzAdJbYb5F2v
G0ISy31+x/Goi43fQS59HzS/MNXJplcmy2mxKUYBT7/ZoJ2A26Q8SukTWD+U8sDn
XIrV4HEOUQ==
=PUw1
-----END PGP SIGNATURE-----
Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block
Pull io_uring 'more data in socket' support from Jens Axboe:
"To be able to fully utilize the 'poll first' support in the core
io_uring branch, it's advantageous knowing if the socket was empty
after a receive. This adds support for that"
* tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block:
io_uring: return hint on whether more data is available after receive
tcp: pass back data left in socket after receive