blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 4cafe86c92 ("blk-mq: run dispatch lock once in case of issuing from list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.
Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have switched to allocate srcu into request queue, so it is fine
to pass request queue to blk_mq_run_dispatch_ops().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.
Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.
So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove hctx_lock and hctx_unlock, and add one helper of
blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops
with rcu/srcu read held.
Compared with hctx_lock()/hctx_unlock():
1) remove 2 branch to 1, so we just need to check
(hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops
2) srcu_idx needn't to be touched in case of non-blocking
3) might_sleep_if() can be moved to the blocking branch
Also put the added blk_mq_run_dispatch_ops() in private header, so that
the following patch can use it out of blk-mq.c.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.
This borrows that same implementation, which in turn is based on the
mm implementation from Linus.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.
Fall back to filemap_read() if we fail.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cgroup.h (therefore swap.h, therefore half of the universe)
includes bpf.h which in turn includes module.h and slab.h.
Since we're about to get rid of that dependency we need
to clean things up.
v2: drop the cpu.h include from cacheinfo.h, it's not necessary
and it makes riscv sensitive to ordering of include files.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Krzysztof Wilczyński <kw@linux.com>
Acked-by: Peter Chen <peter.chen@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20211120035253.72074-1-kuba@kernel.org/ # v1
Link: https://lore.kernel.org/all/20211120165528.197359-1-kuba@kernel.org/ # cacheinfo discussion
Link: https://lore.kernel.org/bpf/20211202203400.1208663-1-kuba@kernel.org
When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.
Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.
Fixes: 5b13bc8a3f ("blk-mq: cleanup request allocation")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Expected case is returning a request, just check for success and return
the request rather than having an error label.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Without checking q->poll_stat in queue_poll_stat_show(), kernel panic
may be caused if q->poll_stat isn't allocated.
Fixes: 48b5c1fbcd ("block: only allocate poll_stats if there's a user of them")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211202090716.3292244-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now. Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just use the disk attached to the request_queue instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a 1:1 relationship between request_queues and gendisks now, so
no need for these extra checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20211126230652.1175636-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Grab a reference to the newly allocated or existing io_context in
create_task_io_context and return it. This simplifies the callers and
removes the need for double lookups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In __copy_io we know that the newly allocate task_struct does not have
an I/O context yet and is not exiting. So just allocate the I/O context
struct and install it directly. There is no need to lock the task
either as it is just being created.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After the prepare side has been moved to the only I/O scheduler that
cares, do the same for the cleanup and the NULL initialization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static. Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 4896c4e64ba5d5d5acdbcf68c5910dd4f6d8fa62.
The helper is not needed any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio->bi_opf isn't finalized before checking the bio, so use it after
submit_bio_checks() returns.
Fixes: 5b13bc8a3f ("blk-mq: cleanup request allocation")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 7cc4ffc555 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.
Consider for example the following job file:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200
[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200
Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.
Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.
Fixes: 7cc4ffc555 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.
This significantly reduces false waker detection for workload like:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
fsync=200
[fastwriter]
numjobs=1
fsync=200
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.
Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So provide a
function blk_mq_sched_get_icq() so that BFQ can lookup ICQ earlier.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is trivial and is only used in one place. Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.
There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124062856.1444266-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Open code elevator_exit in it's only caller, and rename __elevator_exit to
elevator_exit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.
Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.
Kill the queue flag for this while at it, since we can just check the
assigned pointer now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need to write to the bio if:
1) No ioprio value has ever been assigned to the blkcg
2) We wouldn't anyway, depending on bio and blkcg IO priority
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.
The limited plugging case is only used for a corner case that does
not matter in real life:
- no ->commit_rqs (so not NVMe)
- no shared tags (so not SCSI)
- not rotational (so no old disk or floppy driver)
- must have multiple queues (so no eMMC)
Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hidden gendisks can't be opened using blkdev_get_*, so we can't really
reach any of the partition scanning paths or partitioning ioctls except
for the initial partition scan from add_disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All modern drivers can support extra partitions using the extended
dev_t. In fact except for the ioctl method drivers never even see
partitions in normal operation.
So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This flag is not set directly anywhere and only inherited from
GENHD_FL_HIDDEN. Just check for GENHD_FL_HIDDEN instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The GENHD_FL_NO_PART_SCAN controls more than just partitions canning,
so rename it to GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_max_parts never returns 0 given that ->minors for devices not using
the extended dev_t must be non-zero, and disk_max_parts always returns
DISK_MAX_PARTS for the latter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE is all about the event reporting
mechanism, so move it to the event_flags field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The flag to indicate an unlocked native capacity is dynamic state,
not a driver capability flag, so move it to disk->state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All request based code is in the blk-mq files now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is only used by the request completion path. Factor out
a blk_status_to_str to keep blk_errors private in blk-core.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_dump_rq_flags deals with a request, so move it to blk-mq.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are only used for request based I/O, so move them where they are
used.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep all the request based code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_init deals with a request structure, so move it to blk-mq.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep all the request based code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move blk_mq_flush_plug_list and blk_mq_plug_issue_direct down in blk-mq.c
to prepare for marking blk_mq_request_issue_directly static without the
need of a forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All this code is tightly coupled to the blk-mq core, so move it
there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-4-hch@lst.de
[axboe: remove doc generation for blk-exec.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is trivial, and flush_dcache_page is always defined, so
just open code it in the 2.5 callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_err_bytes is only used by the scsi midlayer, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to call rq_qos_done() regardless of whether or not we're freeing
the request or not, as the reference count doesn't cover the IO completion
tracking.
Fixes: f794f3351f ("block: add support for blk_mq_end_request_batch()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The build warning:
block/blk-core.c:968: warning: Function parameter or member 'iob'
not described in 'bio_poll'.
Fixes: 5a72e899ce ("block: add a struct io_comp_batch argument to fops->iopoll()")
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk->fops->owner is grabbed in blkdev_get_no_open() after the disk
kobject refcount is increased. This way can't make sure that
disk->fops->owner is still alive since del_gendisk() still can move
on if the kobject refcount of disk is grabbed by open() and
disk->fops->open() isn't called yet.
Fixes the issue by moving try_module_get() into blkdev_get_by_dev()
with ->open_mutex() held, then we can drain the in-progress open()
in del_gendisk(). Meantime new open() won't succeed because disk
becomes not alive.
This way is reasonable because blkdev_get_no_open() needn't to touch
disk->fops or defined callbacks.
Cc: Christoph Hellwig <hch@lst.de>
Cc: czhong@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111020343.316126-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We never insert flush request into scheduler queue before.
Recently commit d92ca9d834 ("blk-mq: don't handle non-flush requests in
blk_insert_flush") tries to handle FUA data request as normal request.
This way has caused warning[1] in mq-deadline dd_exit_sched() or io hang in
case of kyber since RQF_ELVPRIV isn't set for flush request, then
->finish_request won't be called.
Fix the issue by inserting FUA data request with blk_mq_request_bypass_insert()
when the device supports FUA, just like what we did before.
[1] https://lore.kernel.org/linux-block/CAHj4cs-_vkTW=dAzbZYGxpEWSpzpcmaNeY1R=vH311+9vMUSdg@mail.gmail.com/
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: d92ca9d834 ("blk-mq: don't handle non-flush requests in blk_insert_flush")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211118153041.2163228-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If blk_queue_enter() failed due to queue is dying, the
blkdev_put_no_open() is needed because blkcg_conf_open_bdev() succeeded.
Fixes: 0c9d338c84 ("blk-cgroup: synchronize blkg creation against policy deactivation")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211102020705.2321858-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_init_mq() is only called before adding disk, when there isn't
any FS I/O, only passthrough requests can be queued, so freezing queue
plus canceling dispatch work is enough to drain any dispatch activities,
then we can avoid synchronize_srcu() in blk_mq_quiesce_queue().
Long boot latency issue can be fixed in case of lots of disks added
during booting.
Fixes: 737eb78e82 ("block: Delay default elevator initialization")
Reported-by: yangerkun <yangerkun@huawei.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211117115502.1600950-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we fail the submission queue checks, we don't put the queue afterwards.
This can cause various issues like stalls on scheduler switch or failure
to remove the device, or like in the original bug report, timeout waiting
for the device on reboot/restart.
While in there, fix a few whitespace discrepancies in the surrounding
code.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215039
Fixes: b637108a40 ("blk-mq: fix filesystem I/O request allocation")
Reported-and-tested-by: Stephen Smith <stephenmsmith@blueyonder.co.uk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
submit_bio_checks() may update bio->bi_opf, so we have to initialize
blk_mq_alloc_data.cmd_flags with bio->bi_opf after submit_bio_checks()
returns when allocating new request.
In case of using cached request, fallback to allocate new request if
cached rq isn't compatible with the incoming bio, otherwise change
rq->cmd_flags with incoming bio->bi_opf.
Fixes: 900e080752 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
KASAN reports a use-after-free report when doing block test:
==================================================================
[10050.967049] BUG: KASAN: use-after-free in
submit_bio_checks+0x1539/0x1550
[10050.977638] Call Trace:
[10050.978190] dump_stack+0x9b/0xce
[10050.979674] print_address_description.constprop.6+0x3e/0x60
[10050.983510] kasan_report.cold.9+0x22/0x3a
[10050.986089] submit_bio_checks+0x1539/0x1550
[10050.989576] submit_bio_noacct+0x83/0xc80
[10050.993714] submit_bio+0xa7/0x330
[10050.994435] mpage_readahead+0x380/0x500
[10050.998009] read_pages+0x1c1/0xbf0
[10051.002057] page_cache_ra_unbounded+0x4c2/0x6f0
[10051.007413] do_page_cache_ra+0xda/0x110
[10051.008207] force_page_cache_ra+0x23d/0x3d0
[10051.009087] page_cache_sync_ra+0xca/0x300
[10051.009970] generic_file_buffered_read+0xbea/0x2130
[10051.012685] generic_file_read_iter+0x315/0x490
[10051.014472] blkdev_read_iter+0x113/0x1b0
[10051.015300] aio_read+0x2ad/0x450
[10051.023786] io_submit_one+0xc8e/0x1d60
[10051.029855] __se_sys_io_submit+0x125/0x350
[10051.033442] do_syscall_64+0x2d/0x40
[10051.034156] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.048733] Allocated by task 18598:
[10051.049482] kasan_save_stack+0x19/0x40
[10051.050263] __kasan_kmalloc.constprop.1+0xc1/0xd0
[10051.051230] kmem_cache_alloc+0x146/0x440
[10051.052060] mempool_alloc+0x125/0x2f0
[10051.052818] bio_alloc_bioset+0x353/0x590
[10051.053658] mpage_alloc+0x3b/0x240
[10051.054382] do_mpage_readpage+0xddf/0x1ef0
[10051.055250] mpage_readahead+0x264/0x500
[10051.056060] read_pages+0x1c1/0xbf0
[10051.056758] page_cache_ra_unbounded+0x4c2/0x6f0
[10051.057702] do_page_cache_ra+0xda/0x110
[10051.058511] force_page_cache_ra+0x23d/0x3d0
[10051.059373] page_cache_sync_ra+0xca/0x300
[10051.060198] generic_file_buffered_read+0xbea/0x2130
[10051.061195] generic_file_read_iter+0x315/0x490
[10051.062189] blkdev_read_iter+0x113/0x1b0
[10051.063015] aio_read+0x2ad/0x450
[10051.063686] io_submit_one+0xc8e/0x1d60
[10051.064467] __se_sys_io_submit+0x125/0x350
[10051.065318] do_syscall_64+0x2d/0x40
[10051.066082] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.067455] Freed by task 13307:
[10051.068136] kasan_save_stack+0x19/0x40
[10051.068931] kasan_set_track+0x1c/0x30
[10051.069726] kasan_set_free_info+0x1b/0x30
[10051.070621] __kasan_slab_free+0x111/0x160
[10051.071480] kmem_cache_free+0x94/0x460
[10051.072256] mempool_free+0xd6/0x320
[10051.072985] bio_free+0xe0/0x130
[10051.073630] bio_put+0xab/0xe0
[10051.074252] bio_endio+0x3a6/0x5d0
[10051.074984] blk_update_request+0x590/0x1370
[10051.075870] scsi_end_request+0x7d/0x400
[10051.076667] scsi_io_completion+0x1aa/0xe50
[10051.077503] scsi_softirq_done+0x11b/0x240
[10051.078344] blk_mq_complete_request+0xd4/0x120
[10051.079275] scsi_mq_done+0xf0/0x200
[10051.080036] virtscsi_vq_done+0xbc/0x150
[10051.080850] vring_interrupt+0x179/0x390
[10051.081650] __handle_irq_event_percpu+0xf7/0x490
[10051.082626] handle_irq_event_percpu+0x7b/0x160
[10051.083527] handle_irq_event+0xcc/0x170
[10051.084297] handle_edge_irq+0x215/0xb20
[10051.085122] asm_call_irq_on_stack+0xf/0x20
[10051.085986] common_interrupt+0xae/0x120
[10051.086830] asm_common_interrupt+0x1e/0x40
==================================================================
Bio will be checked at beginning of submit_bio_noacct(). If bio needs
to be throttled, it will start the timer and stop submit bio directly.
Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
But in the current process, if bio is throttled, it will still set bio
issue->value by blkcg_bio_issue_init(). This is redundant and may cause
the above use-after-free.
CPU0 CPU1
submit_bio
submit_bio_noacct
submit_bio_checks
blk_throtl_bio()
<=mod_timer(&sq->pending_timer
blk_throtl_dispatch_work_fn
submit_bio_noacct() <= bio have
throttle tag, will throw directly
and bio issue->value will be set
here
bio_endio()
bio_put()
bio_free() <= free this bio
blkcg_bio_issue_init(bio)
<= bio has been freed and
will lead to UAF
return BLK_QC_T_NONE
Fix this by remove extra blkcg_bio_issue_init.
Fixes: e439bedf6b (blkcg: consolidate bio_issue_init() to be a part of core)
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When BLKRESETZONE ioctl and data read race, the data read leaves stale
page cache. The commit e511350590 ("block: Discard page cache of zone
reset target range") added page cache truncation to avoid stale page
cache after the ioctl. However, the stale page cache still can be read
during the reset zone operation for the ioctl. To avoid the stale page
cache completely, hold invalidate_lock of the block device file mapping.
Fixes: e511350590 ("block: Discard page cache of zone reset target range")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211111085238.942492-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is very annoying to have two block layer functions which share same
name, so rename blk_attempt_bio_merge in blk-mq.c as
blk_mq_attempt_bio_merge.
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_sched_bio_merge is only called from blk-mq.c:blk_attempt_bio_merge(),
which is called when queue usage counter is grabbed already:
1) blk_mq_get_new_requests()
2) blk_mq_get_request()
- cached request in current plug owns one queue usage counter
So don't grab ->q_usage_counter in blk_mq_sched_bio_merge(), and more
importantly this nest way causes hang in blk_mq_freeze_queue_wait().
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The naming got changed as part of a revision of the patchset, but the
kerneldoc apparently never got updated. Fix it.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: a2247f19ee ("block: Add independent access ranges support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have done a spring cleaning on all drivers and added
error checking / handling, let's keep it that way and ensure
no new drivers fail to stick with it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211110002949.999380-1-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kernel test robot reports that we now trigger some sparse warnings:
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
which is due to ->rq_flags being an unsigned int, rather than the
stronger type req_flags_t enum.
Change the type to req_flags_t to silence this warning.
Fixes: 56f8da642b ("block: add rq_flags to struct blk_mq_alloc_data")
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.
Fixes: 22dd6d3566 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.
This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.
Fixes: 351499a172 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqOcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr3yEADD9Cx8oNk3KzWV3c3JlIR4JQtvpczS3dho
KkGU0D5fOh1sViXbLBNr6VxypcEIKQoHxDQQ6qid1kOu/B3mCNM1duLsVjyj3Qa0
7nbm2dVUsD/EVDuXedRmMvcfCUx6Z23DbpI182wXtIPaCsEEmsANzHnZNg38OV44
25SYG0QUvb9ViSz1Y1GORu0ttEJNF2GhZfiBpb0WveRnY7eTSL/PnHNDzHsSeFv4
zD0W205g7jKbt0+57kgNElTz7DbdM3p8XVex+aXPlFaHz2qx4ZoJJIsaMv/P8tT5
14b50cB41xnPvlGTvqr1WfZZfJocDNq2rG+fh6N5D1sO86ogWpj7psiiADfa0pb6
ZWoJqhk3BvEUMPQ5N/BJ/8j3FWGIYWtKQf4QcyxrJYpqDwtwbBfMlzKkc7JMPFYk
JAi6uq1uF5SbA4x99G90tK85LvxsbkseyIYXgBJ/GIyW5doIPkD9TPDEzJMCdHOe
laynHS5PMHzuhPLuEDDn9sTVXpZWAMBnoy4j1L4wGmBjiogYWLTSJVobODzCAqHY
1Va2oP6SXfCdVRkCysFbcrdsjJuoIWlMKrdE40tNvkmU0v7sEX0Zd+GLHiaWdIZa
fgxC9fmZtDDOowCp+Iw0VaAqPeeptmyUrof06ZktJleOAscX7kSwbxPdmr1FM0jy
dbnLDyaq/A==
=QaFI
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block
Pull more block driver updates from Jens Axboe:
- Last series adding error handling support for add_disk() in drivers.
After this one, and once the SCSI side has been merged, we can
finally annotate add_disk() as must_check. (Luis)
- bcache fixes (Coly)
- zram fixes (Ming)
- ataflop locking fix (Tetsuo)
- nbd fixes (Ye, Yu)
- MD merge via Song
- Cleanup (Yang)
- sysfs fix (Guoqing)
- Misc fixes (Geert, Wu, luo)
* tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits)
bcache: Revert "bcache: use bvec_virt"
ataflop: Add missing semicolon to return statement
floppy: address add_disk() error handling on probe
ataflop: address add_disk() error handling on probe
block: update __register_blkdev() probe documentation
ataflop: remove ataflop_probe_lock mutex
mtd/ubi/block: add error handling support for add_disk()
block/sunvdc: add error handling support for add_disk()
z2ram: add error handling support for add_disk()
nvdimm/pmem: use add_disk() error handling
nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned
nvdimm/blk: add error handling support for add_disk()
nvdimm/blk: avoid calling del_gendisk() on early failures
nvdimm/btt: add error handling support for add_disk()
nvdimm/btt: use goto error labels on btt_blk_init()
loop: Remove duplicate assignments
drbd: Fix double free problem in drbd_create_device
nvdimm/btt: do not call del_gendisk() if not needed
bcache: fix use-after-free problem in bcache_device_free()
zram: replace fsync_bdev with sync_blockdev
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqAcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpojrD/4yA+GgV+jWeIepYWvU81TQFpt9AJmzWbrY
uryj4dy7EdMjun+JkAP8k4qreqvTZRsJMkr9dhmS4qaM8/Vt8K/RU/0n/lxNVmqc
1//ZaTS6DURVAc52GHIXD3q4cv8pHofTZZlrj1Hgz35shlOayStGJtktH5f8uQl4
5Yxjh+HKr15Chym+fKlbR6T7BgVxxNyhT9q89BgUwMAJX+1KRVtwtkyVK5IbObFy
zOeiC+n9niQ6iJHcLoqb7LjfBOs/VjdNOQYGSCAnrBxuQ8GnEP2xDw2nvFlOPE12
5tWEwTgAX7381ilbL6VvNTlTafIs/Axt8mI0cY/OMW7ApiHwO3rXjQSqA4yrnKCJ
h6M1QavqThd2DtMnOi0U5wwgtD2UjS+CMpK5XFxeIyl6GqTgZcaWm3VqRnG68KZD
r5+o99GKWCHy0cckxq2WiWJouReeNZ9u9R6HNDw0Vb8UNyWgBR+v2MkX+SHS/c85
2gXm10hwBH7BFnC4X8ceiuT/bm7xm9S6D/3LCVitlUTBRfqobsQEQjSciPeoOtL0
rRSTKob7jtokiB2q01wx3q1jnUMpxE1fqJkpLjUvebTzw+a+xfPwy0nNTGq0XXIv
WMVRRpSWCZm04Ru0q/K8cj0GOyur5x+ilefZ1V+/sRU5dVmGuJgbJUxei1HPC6eV
z9Rn0aFv4g==
=1GPi
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Set of fixes for the batched tag allocation (Ming, me)
- add_disk() error handling fix (Luis)
- Nested queue quiesce fixes (Ming)
- Shared tags init error handling fix (Ye)
- Misc cleanups (Jean, Ming, me)
* tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block:
nvme: wait until quiesce is done
scsi: make sure that request queue queiesce and unquiesce balanced
scsi: avoid to quiesce sdev->request_queue two times
blk-mq: add one API for waiting until quiesce is done
blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
block: fix device_add_disk() kobject_create_and_add() error handling
block: ensure cached plug request matches the current queue
block: move queue enter logic into blk_mq_submit_bio()
block: make bio_queue_enter() fast-path available inline
block: split request allocation components into helpers
block: have plug stored requests hold references to the queue
blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
blk-mq: add RQF_ELV debug entry
blk-mq: only try to run plug merge if request has same queue with incoming bio
block: move RQF_ELV setting into allocators
dm: don't stop request queue after the dm device is suspended
block: replace always false argument with 'false'
block: assign correct tag before doing prefetch of request
blk-mq: fix redundant check of !e expression
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKp6YQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphBaD/9Wm1lUI5qPB9tIQERWNsx4phr2Xb5WV3rg
6ZFPDnjw0EEdfJmHLMKT5w/lguARtTKJAPeMmvX38RhB8U+44jeok0Bc+mJajjBD
7OkE2KHQ/GnnZxzHQVyoYYo390zLOwNqALDLE55bfQsxKjJXXM03Stwcubzg9MZ+
JRR4mFISU+7qhYu6Qx1RAvEqrGqWzstgrkbsnf/MJ7dKAy3cyqOiXKc/gDKGipIr
skSrwbsB9AE6zW5UMdWN0r/wz1fm2bfJOR1meYgUj83exi1EoWDt52oceYWPky2D
ievHRWUXBkcGgDNvKO6ZO99spJqs/MUy0jhzAW46qYDYUfMcbzthewWzX5mxQBDx
YtDgH949DLVB/jPg0eoTeB8j1hcfO/gU6jdR9bJl4ZQKE/9UB9RMfusXHklmcSTL
nZY8SfWp2ecu6HkGUPXJf051mEwCHCaROIslVeNeZBAbh/duVz3VG9EIO+seLOFR
JJaqsB3xVk1HhqUYoQ+IiAHAkROa73jSv4hsvDu8BX923pv4JWAbTslRFU8lCG1w
heJZ/QiP20QtEcEVCg14wEMiH5tBMvM7OhslQbhOSs+wcm8zyShjRhoa2idq/aTw
k289DEfI52k8SIyI0+rClbWR+hEQ9y2K5opur8gR1SuIv9rbM+UzBc5hDbfT6t3W
4xPpYESkhg==
=bjgx
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/bdev-size-2021-11-09' of git://git.kernel.dk/linux-block
Pull more bdev size updates from Jens Axboe:
"Two followup changes for the bdev-size series from this merge window:
- Add loff_t cast to bdev_nr_bytes() (Christoph)
- Use bdev_nr_bytes() consistently for the block parts at least (me)"
* tag 'for-5.16/bdev-size-2021-11-09' of git://git.kernel.dk/linux-block:
block: use new bdev_nr_bytes() helper for blkdev_{read,write}_iter()
block: add a loff_t cast to bdev_nr_bytes
Some drivers(NVMe, SCSI) need to call quiesce and unquiesce in pair, but it
is hard to switch to this style, so these drivers need one atomic flag for
helping to balance quiesce and unquiesce.
When quiesce is in-progress, the driver still needs to wait until
the quiesce is done, so add API of blk_mq_wait_quiesce_done() for
these drivers.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We got UAF report on v5.10 as follows:
[ 1446.674930] ==================================================================
[ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348
[ 1446.677851]
[ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2
[ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn
[ 1446.681448] Call Trace:
[ 1446.681800] dump_stack+0x9b/0xce
[ 1446.682916] print_address_description.constprop.6+0x3e/0x60
[ 1446.685999] kasan_report.cold.9+0x22/0x3a
[ 1446.687186] blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.687785] blk_mq_dispatch_rq_list+0x21a/0x1d40
[ 1446.692576] __blk_mq_do_dispatch_sched+0x394/0x830
[ 1446.695758] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[ 1446.698279] blk_mq_sched_dispatch_requests+0xdf/0x140
[ 1446.698967] __blk_mq_run_hw_queue+0xc0/0x270
[ 1446.699561] __blk_mq_delay_run_hw_queue+0x4cc/0x550
[ 1446.701407] blk_mq_run_hw_queue+0x13b/0x2b0
[ 1446.702593] blk_mq_sched_insert_requests+0x1de/0x390
[ 1446.703309] blk_mq_flush_plug_list+0x4b4/0x760
[ 1446.705408] blk_flush_plug_list+0x2c5/0x480
[ 1446.708471] blk_finish_plug+0x55/0xa0
[ 1446.708980] blk_throtl_dispatch_work_fn+0x23b/0x2e0
[ 1446.711236] process_one_work+0x6d4/0xfe0
[ 1446.711778] worker_thread+0x91/0xc80
[ 1446.713400] kthread+0x32d/0x3f0
[ 1446.714362] ret_from_fork+0x1f/0x30
[ 1446.714846]
[ 1446.715062] Allocated by task 1:
[ 1446.715509] kasan_save_stack+0x19/0x40
[ 1446.716026] __kasan_kmalloc.constprop.1+0xc1/0xd0
[ 1446.716673] blk_mq_init_tags+0x6d/0x330
[ 1446.717207] blk_mq_alloc_rq_map+0x50/0x1c0
[ 1446.717769] __blk_mq_alloc_map_and_request+0xe5/0x320
[ 1446.718459] blk_mq_alloc_tag_set+0x679/0xdc0
[ 1446.719050] scsi_add_host_with_dma.cold.3+0xa0/0x5db
[ 1446.719736] virtscsi_probe+0x7bf/0xbd0
[ 1446.720265] virtio_dev_probe+0x402/0x6c0
[ 1446.720808] really_probe+0x276/0xde0
[ 1446.721320] driver_probe_device+0x267/0x3d0
[ 1446.721892] device_driver_attach+0xfe/0x140
[ 1446.722491] __driver_attach+0x13a/0x2c0
[ 1446.723037] bus_for_each_dev+0x146/0x1c0
[ 1446.723603] bus_add_driver+0x3fc/0x680
[ 1446.724145] driver_register+0x1c0/0x400
[ 1446.724693] init+0xa2/0xe8
[ 1446.725091] do_one_initcall+0x9e/0x310
[ 1446.725626] kernel_init_freeable+0xc56/0xcb9
[ 1446.726231] kernel_init+0x11/0x198
[ 1446.726714] ret_from_fork+0x1f/0x30
[ 1446.727212]
[ 1446.727433] Freed by task 26992:
[ 1446.727882] kasan_save_stack+0x19/0x40
[ 1446.728420] kasan_set_track+0x1c/0x30
[ 1446.728943] kasan_set_free_info+0x1b/0x30
[ 1446.729517] __kasan_slab_free+0x111/0x160
[ 1446.730084] kfree+0xb8/0x520
[ 1446.730507] blk_mq_free_map_and_requests+0x10b/0x1b0
[ 1446.731206] blk_mq_realloc_hw_ctxs+0x8cb/0x15b0
[ 1446.731844] blk_mq_init_allocated_queue+0x374/0x1380
[ 1446.732540] blk_mq_init_queue_data+0x7f/0xd0
[ 1446.733155] scsi_mq_alloc_queue+0x45/0x170
[ 1446.733730] scsi_alloc_sdev+0x73c/0xb20
[ 1446.734281] scsi_probe_and_add_lun+0x9a6/0x2d90
[ 1446.734916] __scsi_scan_target+0x208/0xc50
[ 1446.735500] scsi_scan_channel.part.3+0x113/0x170
[ 1446.736149] scsi_scan_host_selected+0x25a/0x360
[ 1446.736783] store_scan+0x290/0x2d0
[ 1446.737275] dev_attr_store+0x55/0x80
[ 1446.737782] sysfs_kf_write+0x132/0x190
[ 1446.738313] kernfs_fop_write_iter+0x319/0x4b0
[ 1446.738921] new_sync_write+0x40e/0x5c0
[ 1446.739429] vfs_write+0x519/0x720
[ 1446.739877] ksys_write+0xf8/0x1f0
[ 1446.740332] do_syscall_64+0x2d/0x40
[ 1446.740802] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1446.741462]
[ 1446.741670] The buggy address belongs to the object at ffff8880185afd00
[ 1446.741670] which belongs to the cache kmalloc-256 of size 256
[ 1446.743276] The buggy address is located 16 bytes inside of
[ 1446.743276] 256-byte region [ffff8880185afd00, ffff8880185afe00)
[ 1446.744765] The buggy address belongs to the page:
[ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac
[ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0
[ 1446.747719] flags: 0x1fffff80010200(slab|head)
[ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240
[ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 1446.750455] page dumped because: kasan: bad access detected
[ 1446.751227]
[ 1446.751445] Memory state around the buggy address:
[ 1446.752102] ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.753090] ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.755065] ^
[ 1446.755589] ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.756574] ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.757566] ==================================================================
Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the
same host initializes it's queue successfully. However, if the second
device failed to allocate memory in blk_mq_alloc_and_init_hctx() from
blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(),
__blk_mq_free_map_and_rqs() will be called on error path, and if
'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed
while it's still used by the first device.
To fix this issue we move release newly allocated hardware context from
blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to
release hardware context in blk_mq_init_allocated_queue.
Fixes: 868f2f0b72 ("blk-mq: dynamic h/w context count")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have new helpers for this, use them rather than the slower inode
size reads. This makes the read/write path consistent with most of
the rest of block as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/a72767cd-3c6d-47f7-80f4-aa025a17b2cb@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>