When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previous queue for compensating wake
up miss, so other allocations on the previous queue won't be starved.
This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.
The race is as follows:
1) 2 hw queues, nr_requests are 2, and wake_batch is one
2) there are 3 waiters on hw queue 0
3) two in-flight requests in hw queue 0 are completed, and only two
waiters of 3 are waken up because of wake_batch, but both the two
waiters can be scheduled to another CPU and cause to switch to hw
queue 1
4) then the 3rd waiter will wait for ever, since no in-flight request
is in hw queue 0 any more.
5) this patch fixes it by the fake wakeup when waiter is scheduled to
another hw queue
Cc: <stable@vger.kernel.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Modified commit message to make it clearer, and make it apply on
top of the 4.18 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:
fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
Call Trace:
dump_stack+0x63/0x88
warn_alloc+0xf5/0x190
__alloc_pages_slowpath+0x8f0/0xb0d
__alloc_pages_nodemask+0x242/0x260
alloc_pages_current+0x6a/0xb0
kmalloc_order+0x18/0x50
kmalloc_order_trace+0x26/0xb0
__kmalloc+0x20e/0x220
blkdev_report_zones_ioctl+0xa5/0x1a0
blkdev_ioctl+0x1ba/0x930
block_ioctl+0x41/0x50
do_vfs_ioctl+0xaa/0x610
SyS_ioctl+0x79/0x90
do_syscall_64+0x79/0x1b0
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When dispatch_rq_from_ctx is called, in the vast majority of cases
the ctx->rq_list is not empty.
Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the number of hardware queues is changed, the drivers will call
blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
the ctx mappings, but the current code doesn't clear the
->dispatch_from hint. This can result in dispatch_from pointing to
a ctx that isn't mapped to the hctx anymore.
Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Moved the placement of the clearing to where we clear other items
pertaining to the existing mapping, added Fixes line, and reworded
the commit message.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can use blk_mq_sched_insert_request() even if we don't have
an IO scheduler attached, since that case will end up being
exactly the same as what blk_mq_queue_io() was doing now.
Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Recently found a bug where a driver left bi_next not NULL and then
called bio_endio(), and then the submitter of the bio used
bio_copy_data() which was treating src and dst as lists of bios.
Fixed that bug by splitting out bio_list_copy_data(), but in case other
things are depending on bi_next in weird ways, add a warning to help
avoid more bugs like that in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since a bio can point to userspace pages (e.g. direct IO), this is
generally necessary.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Found a bug (with ASAN) where we were passing a bio to bio_copy_data()
with bi_next not NULL, when it should have been - a driver had left
bi_next set to something after calling bio_endio().
Since the normal case is only copying single bios, split out
bio_list_copy_data() to avoid more bugs like this in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add versions that take bvec_iter args instead of using bio->bi_iter - to
be used by bcachefs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Minor optimization - remove a pointer indirection when using fs_bio_set.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similarly to mempool_init()/mempool_exit(), take a pointer indirection
out of allocation/freeing by allowing biosets to be embedded in other
structs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Minor performance improvement by getting rid of pointer indirections
from allocation/freeing fastpaths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Same numerical value (for now at least), but a much better documentation
of intent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We just can't do I/O when doing block layer requests allocations,
so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_old_get_request already has it at hand, and in blk_queue_bio, which
is the fast path, it is constant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't expect the async depth to be smaller than the wake batch
count for sbitmap, but just in case, inform sbitmap of what shallow
depth kyber may use.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If our shallow depth is smaller than the wake batching of sbitmap,
we can introduce hangs. Ensure that sbitmap knows how low we'll go.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bfqd->sb_shift was attempted used as a cache for the sbitmap queue
shift, but we don't need it, as it never changes. Kill it with fire.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It doesn't change, so don't put it in the per-IO hot path.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reserved tags are used for error handling, we don't need to
care about them for regular IO. The core won't call us for these
anyway.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When invoked for an I/O request rq, the prepare_request hook of bfq
increments reference counters in the destination bfq_queue for rq. In
this respect, after this hook has been invoked, rq may still be
transformed into a request with no icq attached, i.e., for bfq, a
request not associated with any bfq_queue. No further hook is invoked
to signal this tranformation to bfq (in general, to the destination
elevator for rq). This leads bfq into an inconsistent state, because
bfq has no chance to correctly lower these counters back. This
inconsistency may in its turn cause incorrect scheduling and hangs. It
certainly causes memory leaks, by making it impossible for bfq to free
the involved bfq_queue.
On the bright side, no transformation can still happen for rq after rq
has been inserted into bfq, or merged with another, already inserted,
request. Exploiting this fact, this commit addresses the above issue
by delaying the preparation of an I/O request to when the request is
inserted or merged.
This change also gives a performance bonus: a lock-contention point
gets removed. To prepare a request, bfq needs to hold its scheduler
lock. After postponing request preparation to insertion or merging, no
lock needs to be grabbed any longer in the prepare_request hook, while
the lock already taken to perform insertion or merging is used to
preparare the request as well.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want this next to blk_account_io_done() for the next change so that
we can call ktime_get() only once for both.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d ("block: disable preemption before
using sched_clock()").
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct blk_issue_stat squashes three things into one u64:
- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling
It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct blk_issue_stat is going away, and bio->bi_issue_stat doesn't even
use the blk-stats interface, so we can provide a separate implementation
specific for bios. The helpers work the same way as the blk-stats
helpers.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A few helpers are only used from blk-wbt.c, so move them there, and put
wbt_track() behind the CONFIG_BLK_WBT typedef. This is in preparation
for changing how the wbt flags are tracked.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Throttle discards like we would any background write. Discards should
be background activity, so if they are impacting foreground IO, then
we will throttle them down.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for having more write queues, in which
case we would have needed to pass in more information than just
a simple 'is_kswapd' boolean.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently special case WRITE and FLUSH, but we should really
just include any command with the write bit set. This ensures
that we account DISCARD.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't build discards bigger than what the user asked for, if the
user decided to limit the size by writing to 'discard_max_bytes'.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 9c40cef2b7 ("sched: Move blk_schedule_flush_plug() out of
__schedule()") moved the blk_schedule_flush_plug() call out of the
interrupt/preempt disabled region in the scheduler. This allows to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug_list().
But it makes more sense to disable interrupts explicitly when the request
queue is locked end reenable them when the request to is unlocked. This
shortens the interrupt disabled section which is important when the plug
list contains requests for more than one queue. The comment which claims
that disabling interrupts around the loop is misleading as the called
functions can reenable interrupts unconditionally anyway and obfuscates the
scope badly:
local_irq_save(flags);
spin_lock(q->queue_lock);
...
queue_unplugged(q...);
scsi_request_fn();
spin_unlock_irq(q->queue_lock);
-------------------^^^ ????
spin_lock_irq(q->queue_lock);
spin_unlock(q->queue_lock);
local_irq_restore(flags);
Aside of that the detached interrupt disabling is a constant pain for
PREEMPT_RT as it requires patching and special casing when RT is enabled
while with the spin_*_irq() variants this happens automatically.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 2fff8a924d ("block: Check locking assumptions at runtime") added a
lockdep_assert_held(q->queue_lock) which makes the WARN_ON() redundant
because lockdep will detect and warn about context violations.
The unconditional WARN_ON() does not provide real additional value, so it
can be removed.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bounce_copy_vec() disables interrupts around kmap_atomic(). This is a
leftover from the old kmap_atomic() implementation which relied on fixed
mapping slots, so the caller had to make sure that the same slot could not
be reused from an interrupting context.
kmap_atomic() was changed to dynamic slots long ago and commit 1ec9c5ddc1
("include/linux/highmem.h: remove the second argument of k[un]map_atomic()")
removed the slot assignements, but the callers were not checked for now
redundant interrupt disabling.
Remove the conditional interrupt disable.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.
Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the legacy block case, we increment the counter right after we
allocate the request, not when the driver handles it. In both the legacy
and blk-mq cases, part_inc_in_flight() is called from
blk_account_io_start() right after we've allocated the request. blk-mq
only considers requests started requests as inflight, but this is
inconsistent with the legacy definition and the intention in the code.
This removes the started condition and instead counts all allocated
requests.
Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 37c7c6c76d.
Turns out some drivers(most are FC drivers) may not use managed
IRQ affinity, and has their customized .map_queues meantime, so
still keep this code for avoiding regression.
Reported-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As it came up in discussion on the mailing list that the semantic
meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely
obvious to everyone, let's add some minimal kerneldoc for a
starter.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The initializing of q->root_blkg is currently outside of queue lock
and rcu, so the blkg may be destroied before the initializing, which
may cause dangling/null references. On the other side, the destroys
of blkg are protected by queue lock or rcu. Put the initializing
inside the queue lock and rcu to make it safer.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
CC: Tejun Heo <tj@kernel.org>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The comment before blkg_create() in blkcg_init_queue() was moved
from blkcg_activate_policy() by commit ec13b1d6f0, but
it does not suit for the new context.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
CC: Tejun Heo <tj@kernel.org>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As described in the comment of blkcg_activate_policy(),
*Update of each blkg is protected by both queue and blkcg locks so
that holding either lock and testing blkcg_policy_enabled() is
always enough for dereferencing policy data.*
with queue lock held, there is no need to hold blkcg lock in
blkcg_deactivate_policy(). Similar case is in
blkcg_activate_policy(), which has removed holding of blkcg lock in
commit 4c55f4f9ad.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
CC: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even if we don't have an IO context attached to a request, we still
need to clear the priv[0..1] pointers, as they could be pointing
to previously used bic/bfqq structures. If we don't do so, we'll
either corrupt memory on dispatching a request, or cause an
imbalance in counters.
Inspired by a fix from Kees.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Fixes: aee69d78de ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.
For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Martin Steigerwald <Martin@Lichtvoll.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.
The PREEMPT_ONLY flag was introduced later in commit 3a0a529971
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.
So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]
[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979
Without this fix, I get an IO error in this test:
# dd if=/dev/sda of=/dev/null iflag=direct & \
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 seconds
The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>