2019-05-01 02:42:43 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2014-05-29 00:15:41 +08:00
|
|
|
/*
|
2016-09-17 22:38:44 +08:00
|
|
|
* Tag allocation using scalable bitmaps. Uses active queue tracking to support
|
|
|
|
* fairer distribution of tags between multiple submitters when a shared tag map
|
|
|
|
* is used.
|
2014-05-29 00:15:41 +08:00
|
|
|
*
|
|
|
|
* Copyright (C) 2013-2014 Jens Axboe
|
|
|
|
*/
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
|
|
|
|
#include <linux/blk-mq.h>
|
2019-07-24 11:48:40 +08:00
|
|
|
#include <linux/delay.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include "blk.h"
|
|
|
|
#include "blk-mq.h"
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
#include "blk-mq-sched.h"
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include "blk-mq-tag.h"
|
|
|
|
|
2014-05-14 05:10:52 +08:00
|
|
|
/*
|
|
|
|
* If a previously inactive queue goes active, bump the active user count.
|
2018-08-09 22:34:17 +08:00
|
|
|
* We need to do this before try to allocate driver tag, then even if fail
|
|
|
|
* to get tag when first time, the other shared-tag users could reserve
|
|
|
|
* budget for it.
|
2014-05-14 05:10:52 +08:00
|
|
|
*/
|
|
|
|
bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
2020-08-19 23:20:27 +08:00
|
|
|
if (blk_mq_is_sbitmap_shared(hctx->flags)) {
|
|
|
|
struct request_queue *q = hctx->queue;
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
|
|
|
|
if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
|
|
|
|
!test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
|
|
|
|
atomic_inc(&set->active_queues_shared_sbitmap);
|
|
|
|
} else {
|
|
|
|
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
|
|
|
|
!test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
|
|
|
|
atomic_inc(&hctx->tags->active_queues);
|
|
|
|
}
|
2014-05-14 05:10:52 +08:00
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-12-23 05:04:42 +08:00
|
|
|
* Wakeup all potentially sleeping on tags
|
2014-05-14 05:10:52 +08:00
|
|
|
*/
|
2014-12-23 05:04:42 +08:00
|
|
|
void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
|
2014-05-14 05:10:52 +08:00
|
|
|
{
|
2020-08-19 23:20:23 +08:00
|
|
|
sbitmap_queue_wake_all(tags->bitmap_tags);
|
2016-09-17 22:38:44 +08:00
|
|
|
if (include_reserve)
|
2020-08-19 23:20:23 +08:00
|
|
|
sbitmap_queue_wake_all(tags->breserved_tags);
|
2014-05-14 05:10:52 +08:00
|
|
|
}
|
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
/*
|
|
|
|
* If a previously busy queue goes inactive, potential waiters could now
|
|
|
|
* be allowed to queue. Wake them up and check.
|
|
|
|
*/
|
|
|
|
void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
struct blk_mq_tags *tags = hctx->tags;
|
2020-08-19 23:20:27 +08:00
|
|
|
struct request_queue *q = hctx->queue;
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
2014-05-21 01:49:02 +08:00
|
|
|
|
2020-08-19 23:20:27 +08:00
|
|
|
if (blk_mq_is_sbitmap_shared(hctx->flags)) {
|
|
|
|
if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
|
|
|
|
&q->queue_flags))
|
|
|
|
return;
|
|
|
|
atomic_dec(&set->active_queues_shared_sbitmap);
|
|
|
|
} else {
|
|
|
|
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
|
|
|
|
return;
|
|
|
|
atomic_dec(&tags->active_queues);
|
|
|
|
}
|
2014-05-21 01:49:02 +08:00
|
|
|
|
2014-12-23 05:04:42 +08:00
|
|
|
blk_mq_tag_wakeup_all(tags, false);
|
2014-05-21 01:49:02 +08:00
|
|
|
}
|
|
|
|
|
2017-01-25 23:11:38 +08:00
|
|
|
static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
|
|
|
|
struct sbitmap_queue *bt)
|
2014-05-09 23:36:49 +08:00
|
|
|
{
|
2020-09-11 18:41:14 +08:00
|
|
|
if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
|
|
|
|
!hctx_may_queue(data->hctx, bt))
|
2020-05-29 21:53:12 +08:00
|
|
|
return BLK_MQ_NO_TAG;
|
2020-06-29 23:08:34 +08:00
|
|
|
|
2017-04-14 15:59:59 +08:00
|
|
|
if (data->shallow_depth)
|
|
|
|
return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
|
|
|
|
else
|
|
|
|
return __sbitmap_queue_get(bt);
|
2014-05-09 23:36:49 +08:00
|
|
|
}
|
|
|
|
|
2017-01-13 23:09:05 +08:00
|
|
|
unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2017-01-13 23:09:05 +08:00
|
|
|
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
|
|
|
|
struct sbitmap_queue *bt;
|
2016-09-17 22:38:44 +08:00
|
|
|
struct sbq_wait_state *ws;
|
2018-11-30 08:36:41 +08:00
|
|
|
DEFINE_SBQ_WAIT(wait);
|
2017-01-13 23:09:05 +08:00
|
|
|
unsigned int tag_offset;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
int tag;
|
|
|
|
|
2017-01-13 23:09:05 +08:00
|
|
|
if (data->flags & BLK_MQ_REQ_RESERVED) {
|
|
|
|
if (unlikely(!tags->nr_reserved_tags)) {
|
|
|
|
WARN_ON_ONCE(1);
|
2020-05-29 21:53:11 +08:00
|
|
|
return BLK_MQ_NO_TAG;
|
2017-01-13 23:09:05 +08:00
|
|
|
}
|
2020-08-19 23:20:23 +08:00
|
|
|
bt = tags->breserved_tags;
|
2017-01-13 23:09:05 +08:00
|
|
|
tag_offset = 0;
|
|
|
|
} else {
|
2020-08-19 23:20:23 +08:00
|
|
|
bt = tags->bitmap_tags;
|
2017-01-13 23:09:05 +08:00
|
|
|
tag_offset = tags->nr_reserved_tags;
|
|
|
|
}
|
|
|
|
|
2017-01-25 23:11:38 +08:00
|
|
|
tag = __blk_mq_get_tag(data, bt);
|
2020-05-29 21:53:12 +08:00
|
|
|
if (tag != BLK_MQ_NO_TAG)
|
2017-01-13 23:09:05 +08:00
|
|
|
goto found_tag;
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2015-11-26 16:13:05 +08:00
|
|
|
if (data->flags & BLK_MQ_REQ_NOWAIT)
|
2020-05-29 21:53:11 +08:00
|
|
|
return BLK_MQ_NO_TAG;
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2017-01-13 23:09:05 +08:00
|
|
|
ws = bt_wait_ptr(bt, data->hctx);
|
2014-05-09 23:36:49 +08:00
|
|
|
do {
|
2018-05-25 01:00:39 +08:00
|
|
|
struct sbitmap_queue *bt_prev;
|
|
|
|
|
2014-12-08 23:46:34 +08:00
|
|
|
/*
|
|
|
|
* We're out of tags on this hardware queue, kick any
|
|
|
|
* pending IO submits before going to sleep waiting for
|
2017-01-19 22:39:17 +08:00
|
|
|
* some to complete.
|
2014-12-08 23:46:34 +08:00
|
|
|
*/
|
2017-01-19 22:39:17 +08:00
|
|
|
blk_mq_run_hw_queue(data->hctx, false);
|
2014-12-08 23:46:34 +08:00
|
|
|
|
2014-12-08 23:49:06 +08:00
|
|
|
/*
|
|
|
|
* Retry tag allocation after running the hardware queue,
|
|
|
|
* as running the queue may also have found completions.
|
|
|
|
*/
|
2017-01-25 23:11:38 +08:00
|
|
|
tag = __blk_mq_get_tag(data, bt);
|
2020-05-29 21:53:12 +08:00
|
|
|
if (tag != BLK_MQ_NO_TAG)
|
2014-12-08 23:49:06 +08:00
|
|
|
break;
|
|
|
|
|
2018-11-30 08:36:41 +08:00
|
|
|
sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
|
2017-11-15 01:24:58 +08:00
|
|
|
|
|
|
|
tag = __blk_mq_get_tag(data, bt);
|
2020-05-29 21:53:12 +08:00
|
|
|
if (tag != BLK_MQ_NO_TAG)
|
2017-11-15 01:24:58 +08:00
|
|
|
break;
|
|
|
|
|
2018-05-25 01:00:39 +08:00
|
|
|
bt_prev = bt;
|
2014-05-09 23:36:49 +08:00
|
|
|
io_schedule();
|
2014-06-01 00:43:37 +08:00
|
|
|
|
2018-11-30 08:36:41 +08:00
|
|
|
sbitmap_finish_wait(bt, ws, &wait);
|
|
|
|
|
2014-06-01 00:43:37 +08:00
|
|
|
data->ctx = blk_mq_get_ctx(data->q);
|
2018-10-30 03:11:38 +08:00
|
|
|
data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
|
2019-01-24 18:25:32 +08:00
|
|
|
data->ctx);
|
2017-01-13 23:09:05 +08:00
|
|
|
tags = blk_mq_tags_from_data(data);
|
|
|
|
if (data->flags & BLK_MQ_REQ_RESERVED)
|
2020-08-19 23:20:23 +08:00
|
|
|
bt = tags->breserved_tags;
|
2017-01-13 23:09:05 +08:00
|
|
|
else
|
2020-08-19 23:20:23 +08:00
|
|
|
bt = tags->bitmap_tags;
|
2017-01-13 23:09:05 +08:00
|
|
|
|
2018-05-25 01:00:39 +08:00
|
|
|
/*
|
|
|
|
* If destination hw queue is changed, fake wake up on
|
|
|
|
* previous queue for compensating the wake up miss, so
|
|
|
|
* other allocations on previous queue won't be starved.
|
|
|
|
*/
|
|
|
|
if (bt != bt_prev)
|
|
|
|
sbitmap_queue_wake_up(bt_prev);
|
|
|
|
|
2017-01-13 23:09:05 +08:00
|
|
|
ws = bt_wait_ptr(bt, data->hctx);
|
2014-05-09 23:36:49 +08:00
|
|
|
} while (1);
|
|
|
|
|
2018-11-30 08:36:41 +08:00
|
|
|
sbitmap_finish_wait(bt, ws, &wait);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2017-01-13 23:09:05 +08:00
|
|
|
found_tag:
|
2020-05-29 21:53:15 +08:00
|
|
|
/*
|
|
|
|
* Give up this allocation if the hctx is inactive. The caller will
|
|
|
|
* retry on an active hctx.
|
|
|
|
*/
|
|
|
|
if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
|
|
|
|
blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
|
|
|
|
return BLK_MQ_NO_TAG;
|
|
|
|
}
|
2017-01-13 23:09:05 +08:00
|
|
|
return tag + tag_offset;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2020-02-26 20:10:15 +08:00
|
|
|
void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
|
|
|
|
unsigned int tag)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2017-02-28 01:04:39 +08:00
|
|
|
if (!blk_mq_tag_is_reserved(tags, tag)) {
|
2014-05-09 23:36:49 +08:00
|
|
|
const int real_tag = tag - tags->nr_reserved_tags;
|
|
|
|
|
2014-11-25 06:52:30 +08:00
|
|
|
BUG_ON(real_tag >= tags->nr_tags);
|
2020-08-19 23:20:23 +08:00
|
|
|
sbitmap_queue_clear(tags->bitmap_tags, real_tag, ctx->cpu);
|
2014-11-25 06:52:30 +08:00
|
|
|
} else {
|
|
|
|
BUG_ON(tag >= tags->nr_reserved_tags);
|
2020-08-19 23:20:23 +08:00
|
|
|
sbitmap_queue_clear(tags->breserved_tags, tag, ctx->cpu);
|
2014-11-25 06:52:30 +08:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
struct bt_iter_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
busy_iter_fn *fn;
|
|
|
|
void *data;
|
|
|
|
bool reserved;
|
|
|
|
};
|
|
|
|
|
2021-05-11 23:22:34 +08:00
|
|
|
static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
|
|
|
|
unsigned int bitnr)
|
|
|
|
{
|
2021-05-11 23:22:35 +08:00
|
|
|
struct request *rq;
|
|
|
|
unsigned long flags;
|
2021-05-11 23:22:34 +08:00
|
|
|
|
2021-05-11 23:22:35 +08:00
|
|
|
spin_lock_irqsave(&tags->lock, flags);
|
|
|
|
rq = tags->rqs[bitnr];
|
2021-09-06 14:50:03 +08:00
|
|
|
if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
|
2021-05-11 23:22:35 +08:00
|
|
|
rq = NULL;
|
|
|
|
spin_unlock_irqrestore(&tags->lock, flags);
|
2021-05-11 23:22:34 +08:00
|
|
|
return rq;
|
|
|
|
}
|
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2016-09-17 22:38:44 +08:00
|
|
|
struct bt_iter_data *iter_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = iter_data->hctx;
|
|
|
|
struct blk_mq_tags *tags = hctx->tags;
|
|
|
|
bool reserved = iter_data->reserved;
|
2014-09-14 07:40:11 +08:00
|
|
|
struct request *rq;
|
2021-05-11 23:22:34 +08:00
|
|
|
bool ret = true;
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
if (!reserved)
|
|
|
|
bitnr += tags->nr_reserved_tags;
|
2017-08-05 03:37:03 +08:00
|
|
|
/*
|
|
|
|
* We can hit rq == NULL here, because the tagging functions
|
2018-09-22 04:34:46 +08:00
|
|
|
* test and set the bit before assigning ->rqs[].
|
2017-08-05 03:37:03 +08:00
|
|
|
*/
|
2021-05-11 23:22:34 +08:00
|
|
|
rq = blk_mq_find_and_get_req(tags, bitnr);
|
|
|
|
if (!rq)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (rq->q == hctx->queue && rq->mq_hctx == hctx)
|
|
|
|
ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
|
|
|
|
blk_mq_put_rq_ref(rq);
|
|
|
|
return ret;
|
2016-09-17 22:38:44 +08:00
|
|
|
}
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2018-09-22 04:34:46 +08:00
|
|
|
/**
|
|
|
|
* bt_for_each - iterate over the requests associated with a hardware queue
|
|
|
|
* @hctx: Hardware queue to examine.
|
|
|
|
* @bt: sbitmap to examine. This is either the breserved_tags member
|
|
|
|
* or the bitmap_tags member of struct blk_mq_tags.
|
|
|
|
* @fn: Pointer to the function that will be called for each request
|
|
|
|
* associated with @hctx that has been assigned a driver tag.
|
|
|
|
* @fn will be called as follows: @fn(@hctx, rq, @data, @reserved)
|
2018-11-09 02:09:50 +08:00
|
|
|
* where rq is a pointer to a request. Return true to continue
|
|
|
|
* iterating tags, false to stop.
|
2018-09-22 04:34:46 +08:00
|
|
|
* @data: Will be passed as third argument to @fn.
|
|
|
|
* @reserved: Indicates whether @bt is the breserved_tags member or the
|
|
|
|
* bitmap_tags member of struct blk_mq_tags.
|
|
|
|
*/
|
2016-09-17 22:38:44 +08:00
|
|
|
static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
|
|
|
|
busy_iter_fn *fn, void *data, bool reserved)
|
|
|
|
{
|
|
|
|
struct bt_iter_data iter_data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.fn = fn,
|
|
|
|
.data = data,
|
|
|
|
.reserved = reserved,
|
|
|
|
};
|
|
|
|
|
|
|
|
sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
struct bt_tags_iter_data {
|
|
|
|
struct blk_mq_tags *tags;
|
|
|
|
busy_tag_iter_fn *fn;
|
|
|
|
void *data;
|
2020-05-29 21:53:14 +08:00
|
|
|
unsigned int flags;
|
2016-09-17 22:38:44 +08:00
|
|
|
};
|
|
|
|
|
2020-05-29 21:53:14 +08:00
|
|
|
#define BT_TAG_ITER_RESERVED (1 << 0)
|
|
|
|
#define BT_TAG_ITER_STARTED (1 << 1)
|
2020-06-05 19:44:10 +08:00
|
|
|
#define BT_TAG_ITER_STATIC_RQS (1 << 2)
|
2020-05-29 21:53:14 +08:00
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
|
2015-06-01 23:29:53 +08:00
|
|
|
{
|
2016-09-17 22:38:44 +08:00
|
|
|
struct bt_tags_iter_data *iter_data = data;
|
|
|
|
struct blk_mq_tags *tags = iter_data->tags;
|
2020-05-29 21:53:14 +08:00
|
|
|
bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
|
2015-06-01 23:29:53 +08:00
|
|
|
struct request *rq;
|
2021-05-11 23:22:34 +08:00
|
|
|
bool ret = true;
|
|
|
|
bool iter_static_rqs = !!(iter_data->flags & BT_TAG_ITER_STATIC_RQS);
|
2015-06-01 23:29:53 +08:00
|
|
|
|
2016-09-17 22:38:44 +08:00
|
|
|
if (!reserved)
|
|
|
|
bitnr += tags->nr_reserved_tags;
|
2017-08-05 03:37:03 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We can hit rq == NULL here, because the tagging functions
|
2020-06-05 19:44:10 +08:00
|
|
|
* test and set the bit before assigning ->rqs[].
|
2017-08-05 03:37:03 +08:00
|
|
|
*/
|
2021-05-11 23:22:34 +08:00
|
|
|
if (iter_static_rqs)
|
2020-06-05 19:44:10 +08:00
|
|
|
rq = tags->static_rqs[bitnr];
|
|
|
|
else
|
2021-05-11 23:22:34 +08:00
|
|
|
rq = blk_mq_find_and_get_req(tags, bitnr);
|
2020-05-29 21:53:14 +08:00
|
|
|
if (!rq)
|
|
|
|
return true;
|
2021-05-11 23:22:34 +08:00
|
|
|
|
|
|
|
if (!(iter_data->flags & BT_TAG_ITER_STARTED) ||
|
|
|
|
blk_mq_request_started(rq))
|
|
|
|
ret = iter_data->fn(rq, iter_data->data, reserved);
|
|
|
|
if (!iter_static_rqs)
|
|
|
|
blk_mq_put_rq_ref(rq);
|
|
|
|
return ret;
|
2016-09-17 22:38:44 +08:00
|
|
|
}
|
|
|
|
|
2018-09-22 04:34:46 +08:00
|
|
|
/**
|
|
|
|
* bt_tags_for_each - iterate over the requests in a tag map
|
|
|
|
* @tags: Tag map to iterate over.
|
|
|
|
* @bt: sbitmap to examine. This is either the breserved_tags member
|
|
|
|
* or the bitmap_tags member of struct blk_mq_tags.
|
|
|
|
* @fn: Pointer to the function that will be called for each started
|
|
|
|
* request. @fn will be called as follows: @fn(rq, @data,
|
2018-11-09 02:09:50 +08:00
|
|
|
* @reserved) where rq is a pointer to a request. Return true
|
|
|
|
* to continue iterating tags, false to stop.
|
2018-09-22 04:34:46 +08:00
|
|
|
* @data: Will be passed as second argument to @fn.
|
2020-05-29 21:53:14 +08:00
|
|
|
* @flags: BT_TAG_ITER_*
|
2018-09-22 04:34:46 +08:00
|
|
|
*/
|
2016-09-17 22:38:44 +08:00
|
|
|
static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
|
2020-05-29 21:53:14 +08:00
|
|
|
busy_tag_iter_fn *fn, void *data, unsigned int flags)
|
2016-09-17 22:38:44 +08:00
|
|
|
{
|
|
|
|
struct bt_tags_iter_data iter_data = {
|
|
|
|
.tags = tags,
|
|
|
|
.fn = fn,
|
|
|
|
.data = data,
|
2020-05-29 21:53:14 +08:00
|
|
|
.flags = flags,
|
2016-09-17 22:38:44 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
if (tags->rqs)
|
|
|
|
sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
|
2015-06-01 23:29:53 +08:00
|
|
|
}
|
|
|
|
|
2020-05-29 21:53:14 +08:00
|
|
|
static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
|
|
|
|
busy_tag_iter_fn *fn, void *priv, unsigned int flags)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED);
|
|
|
|
|
|
|
|
if (tags->nr_reserved_tags)
|
2020-08-19 23:20:23 +08:00
|
|
|
bt_tags_for_each(tags, tags->breserved_tags, fn, priv,
|
2020-05-29 21:53:14 +08:00
|
|
|
flags | BT_TAG_ITER_RESERVED);
|
2020-08-19 23:20:23 +08:00
|
|
|
bt_tags_for_each(tags, tags->bitmap_tags, fn, priv, flags);
|
2020-05-29 21:53:14 +08:00
|
|
|
}
|
|
|
|
|
2018-09-22 04:34:46 +08:00
|
|
|
/**
|
2020-05-29 21:53:14 +08:00
|
|
|
* blk_mq_all_tag_iter - iterate over all requests in a tag map
|
2018-09-22 04:34:46 +08:00
|
|
|
* @tags: Tag map to iterate over.
|
2020-05-29 21:53:14 +08:00
|
|
|
* @fn: Pointer to the function that will be called for each
|
2018-09-22 04:34:46 +08:00
|
|
|
* request. @fn will be called as follows: @fn(rq, @priv,
|
|
|
|
* reserved) where rq is a pointer to a request. 'reserved'
|
2018-11-09 02:09:50 +08:00
|
|
|
* indicates whether or not @rq is a reserved request. Return
|
|
|
|
* true to continue iterating tags, false to stop.
|
2018-09-22 04:34:46 +08:00
|
|
|
* @priv: Will be passed as second argument to @fn.
|
2020-06-05 19:44:10 +08:00
|
|
|
*
|
|
|
|
* Caller has to pass the tag map from which requests are allocated.
|
2018-09-22 04:34:46 +08:00
|
|
|
*/
|
2020-05-29 21:53:14 +08:00
|
|
|
void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
|
|
|
|
void *priv)
|
2015-06-01 23:29:53 +08:00
|
|
|
{
|
2020-06-15 17:12:23 +08:00
|
|
|
__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
|
2015-06-01 23:29:53 +08:00
|
|
|
}
|
|
|
|
|
2018-09-22 04:34:46 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
|
|
|
|
* @tagset: Tag set to iterate over.
|
|
|
|
* @fn: Pointer to the function that will be called for each started
|
|
|
|
* request. @fn will be called as follows: @fn(rq, @priv,
|
|
|
|
* reserved) where rq is a pointer to a request. 'reserved'
|
2018-11-09 02:09:50 +08:00
|
|
|
* indicates whether or not @rq is a reserved request. Return
|
|
|
|
* true to continue iterating tags, false to stop.
|
2018-09-22 04:34:46 +08:00
|
|
|
* @priv: Will be passed as second argument to @fn.
|
2021-05-11 23:22:34 +08:00
|
|
|
*
|
|
|
|
* We grab one request reference before calling @fn and release it after
|
|
|
|
* @fn returns.
|
2018-09-22 04:34:46 +08:00
|
|
|
*/
|
2016-03-10 19:58:46 +08:00
|
|
|
void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
|
|
|
|
busy_tag_iter_fn *fn, void *priv)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < tagset->nr_hw_queues; i++) {
|
|
|
|
if (tagset->tags && tagset->tags[i])
|
2020-05-29 21:53:14 +08:00
|
|
|
__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
|
|
|
|
BT_TAG_ITER_STARTED);
|
2016-03-10 19:58:46 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
|
|
|
|
|
2019-07-24 11:48:40 +08:00
|
|
|
static bool blk_mq_tagset_count_completed_rqs(struct request *rq,
|
|
|
|
void *data, bool reserved)
|
|
|
|
{
|
|
|
|
unsigned *count = data;
|
|
|
|
|
|
|
|
if (blk_mq_request_completed(rq))
|
|
|
|
(*count)++;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2021-03-20 06:52:22 +08:00
|
|
|
* blk_mq_tagset_wait_completed_request - Wait until all scheduled request
|
|
|
|
* completions have finished.
|
2019-07-24 11:48:40 +08:00
|
|
|
* @tagset: Tag set to drain completed request
|
|
|
|
*
|
|
|
|
* Note: This function has to be run after all IO queues are shutdown
|
|
|
|
*/
|
|
|
|
void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
|
|
|
|
{
|
|
|
|
while (true) {
|
|
|
|
unsigned count = 0;
|
|
|
|
|
|
|
|
blk_mq_tagset_busy_iter(tagset,
|
|
|
|
blk_mq_tagset_count_completed_rqs, &count);
|
|
|
|
if (!count)
|
|
|
|
break;
|
|
|
|
msleep(5);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
|
|
|
|
|
2018-09-22 04:34:46 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
|
|
|
|
* @q: Request queue to examine.
|
|
|
|
* @fn: Pointer to the function that will be called for each request
|
|
|
|
* on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
|
|
|
|
* reserved) where rq is a pointer to a request and hctx points
|
|
|
|
* to the hardware queue associated with the request. 'reserved'
|
|
|
|
* indicates whether or not @rq is a reserved request.
|
|
|
|
* @priv: Will be passed as third argument to @fn.
|
|
|
|
*
|
|
|
|
* Note: if @q->tag_set is shared with other request queues then @fn will be
|
|
|
|
* called for all requests on all queues that share that tag set and not only
|
|
|
|
* for requests associated with @q.
|
|
|
|
*/
|
2015-09-28 03:01:51 +08:00
|
|
|
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
|
2014-09-14 07:40:11 +08:00
|
|
|
void *priv)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2015-09-28 03:01:51 +08:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
2018-08-21 15:15:04 +08:00
|
|
|
/*
|
2018-09-22 04:34:46 +08:00
|
|
|
* __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
|
|
|
|
* while the queue is frozen. So we can use q_usage_counter to avoid
|
2020-09-19 11:54:25 +08:00
|
|
|
* racing with it.
|
2018-08-21 15:15:04 +08:00
|
|
|
*/
|
2018-09-26 00:36:20 +08:00
|
|
|
if (!percpu_ref_tryget(&q->q_usage_counter))
|
2018-08-21 15:15:04 +08:00
|
|
|
return;
|
2015-09-28 03:01:51 +08:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
struct blk_mq_tags *tags = hctx->tags;
|
|
|
|
|
|
|
|
/*
|
2018-09-22 04:34:46 +08:00
|
|
|
* If no software queues are currently mapped to this
|
2015-09-28 03:01:51 +08:00
|
|
|
* hardware queue, there's nothing to check
|
|
|
|
*/
|
|
|
|
if (!blk_mq_hw_queue_mapped(hctx))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (tags->nr_reserved_tags)
|
2020-08-19 23:20:23 +08:00
|
|
|
bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
|
|
|
|
bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
|
2014-05-09 23:36:49 +08:00
|
|
|
}
|
2018-09-26 00:36:20 +08:00
|
|
|
blk_queue_exit(q);
|
2014-05-09 23:36:49 +08:00
|
|
|
}
|
|
|
|
|
2016-09-17 16:28:24 +08:00
|
|
|
static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
|
|
|
|
bool round_robin, int node)
|
2014-05-09 23:36:49 +08:00
|
|
|
{
|
2016-09-17 16:28:24 +08:00
|
|
|
return sbitmap_queue_init_node(bt, depth, -1, round_robin, GFP_KERNEL,
|
|
|
|
node);
|
2014-05-09 23:36:49 +08:00
|
|
|
}
|
|
|
|
|
2021-05-13 20:00:57 +08:00
|
|
|
int blk_mq_init_bitmaps(struct sbitmap_queue *bitmap_tags,
|
|
|
|
struct sbitmap_queue *breserved_tags,
|
|
|
|
unsigned int queue_depth, unsigned int reserved,
|
|
|
|
int node, int alloc_policy)
|
2014-05-09 23:36:49 +08:00
|
|
|
{
|
2021-05-13 20:00:57 +08:00
|
|
|
unsigned int depth = queue_depth - reserved;
|
2016-09-17 16:28:24 +08:00
|
|
|
bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2021-05-13 20:00:57 +08:00
|
|
|
if (bt_alloc(bitmap_tags, depth, round_robin, node))
|
2020-08-19 23:20:21 +08:00
|
|
|
return -ENOMEM;
|
2021-05-13 20:00:57 +08:00
|
|
|
if (bt_alloc(breserved_tags, reserved, round_robin, node))
|
2016-09-17 22:38:44 +08:00
|
|
|
goto free_bitmap_tags;
|
2014-05-09 23:36:49 +08:00
|
|
|
|
2021-05-13 20:00:57 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_bitmap_tags:
|
|
|
|
sbitmap_queue_free(bitmap_tags);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
|
|
|
|
int node, int alloc_policy)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = blk_mq_init_bitmaps(&tags->__bitmap_tags,
|
|
|
|
&tags->__breserved_tags,
|
|
|
|
tags->nr_tags, tags->nr_reserved_tags,
|
|
|
|
node, alloc_policy);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2020-08-19 23:20:23 +08:00
|
|
|
tags->bitmap_tags = &tags->__bitmap_tags;
|
|
|
|
tags->breserved_tags = &tags->__breserved_tags;
|
|
|
|
|
2020-08-19 23:20:21 +08:00
|
|
|
return 0;
|
2014-05-09 23:36:49 +08:00
|
|
|
}
|
|
|
|
|
2021-05-13 20:00:57 +08:00
|
|
|
int blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *set)
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
{
|
|
|
|
int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags);
|
2021-05-13 20:00:57 +08:00
|
|
|
int i, ret;
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
|
2021-05-13 20:00:57 +08:00
|
|
|
ret = blk_mq_init_bitmaps(&set->__bitmap_tags, &set->__breserved_tags,
|
|
|
|
set->queue_depth, set->reserved_tags,
|
|
|
|
set->numa_node, alloc_policy);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
|
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
|
|
|
struct blk_mq_tags *tags = set->tags[i];
|
|
|
|
|
|
|
|
tags->bitmap_tags = &set->__bitmap_tags;
|
|
|
|
tags->breserved_tags = &set->__breserved_tags;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
sbitmap_queue_free(&set->__bitmap_tags);
|
|
|
|
sbitmap_queue_free(&set->__breserved_tags);
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
|
2015-01-24 05:18:00 +08:00
|
|
|
unsigned int reserved_tags,
|
2020-08-19 23:20:22 +08:00
|
|
|
int node, unsigned int flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2020-08-19 23:20:22 +08:00
|
|
|
int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(flags);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
struct blk_mq_tags *tags;
|
|
|
|
|
|
|
|
if (total_tags > BLK_MQ_TAG_MAX) {
|
|
|
|
pr_err("blk-mq: tag depth too large\n");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
tags = kzalloc_node(sizeof(*tags), GFP_KERNEL, node);
|
|
|
|
if (!tags)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
tags->nr_tags = total_tags;
|
|
|
|
tags->nr_reserved_tags = reserved_tags;
|
2021-05-11 23:22:35 +08:00
|
|
|
spin_lock_init(&tags->lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2021-03-11 16:17:13 +08:00
|
|
|
if (blk_mq_is_sbitmap_shared(flags))
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
return tags;
|
|
|
|
|
2020-08-19 23:20:21 +08:00
|
|
|
if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
|
|
|
|
kfree(tags);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
return tags;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
}
|
|
|
|
|
2020-08-19 23:20:22 +08:00
|
|
|
void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
{
|
2021-03-11 16:17:13 +08:00
|
|
|
if (!blk_mq_is_sbitmap_shared(flags)) {
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
sbitmap_queue_free(tags->bitmap_tags);
|
|
|
|
sbitmap_queue_free(tags->breserved_tags);
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
kfree(tags);
|
|
|
|
}
|
|
|
|
|
2017-01-20 01:59:07 +08:00
|
|
|
int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_tags **tagsptr, unsigned int tdepth,
|
|
|
|
bool can_grow)
|
2014-05-21 01:49:02 +08:00
|
|
|
{
|
2017-01-20 01:59:07 +08:00
|
|
|
struct blk_mq_tags *tags = *tagsptr;
|
|
|
|
|
|
|
|
if (tdepth <= tags->nr_reserved_tags)
|
2014-05-21 01:49:02 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
2017-01-20 01:59:07 +08:00
|
|
|
* If we are allowed to grow beyond the original size, allocate
|
|
|
|
* a new set of tags before freeing the old one.
|
2014-05-21 01:49:02 +08:00
|
|
|
*/
|
2017-01-20 01:59:07 +08:00
|
|
|
if (tdepth > tags->nr_tags) {
|
|
|
|
struct blk_mq_tag_set *set = hctx->queue->tag_set;
|
|
|
|
struct blk_mq_tags *new;
|
|
|
|
|
|
|
|
if (!can_grow)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need some sort of upper limit, set it high enough that
|
|
|
|
* no valid use cases should require more.
|
|
|
|
*/
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
if (tdepth > MAX_SCHED_RQ)
|
2017-01-20 01:59:07 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2021-10-05 18:23:35 +08:00
|
|
|
new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth);
|
2017-01-20 01:59:07 +08:00
|
|
|
if (!new)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
blk_mq_free_rqs(set, *tagsptr, hctx->queue_num);
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 20:00:58 +08:00
|
|
|
blk_mq_free_rq_map(*tagsptr, set->flags);
|
2017-01-20 01:59:07 +08:00
|
|
|
*tagsptr = new;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Don't need (or can't) update reserved tags here, they
|
|
|
|
* remain static and should never need resizing.
|
|
|
|
*/
|
2020-08-19 23:20:23 +08:00
|
|
|
sbitmap_queue_resize(tags->bitmap_tags,
|
2018-08-02 18:23:26 +08:00
|
|
|
tdepth - tags->nr_reserved_tags);
|
2017-01-20 01:59:07 +08:00
|
|
|
}
|
2016-09-17 22:38:44 +08:00
|
|
|
|
2014-05-21 01:49:02 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 23:20:24 +08:00
|
|
|
void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
|
|
|
|
{
|
|
|
|
sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
|
|
|
|
}
|
|
|
|
|
2021-10-05 18:23:34 +08:00
|
|
|
void blk_mq_tag_update_sched_shared_sbitmap(struct request_queue *q)
|
|
|
|
{
|
|
|
|
sbitmap_queue_resize(&q->sched_bitmap_tags,
|
|
|
|
q->nr_requests - q->tag_set->reserved_tags);
|
|
|
|
}
|
|
|
|
|
2014-10-30 21:45:11 +08:00
|
|
|
/**
|
|
|
|
* blk_mq_unique_tag() - return a tag that is unique queue-wide
|
|
|
|
* @rq: request for which to compute a unique tag
|
|
|
|
*
|
|
|
|
* The tag field in struct request is unique per hardware queue but not over
|
|
|
|
* all hardware queues. Hence this function that returns a tag with the
|
|
|
|
* hardware context index in the upper bits and the per hardware queue tag in
|
|
|
|
* the lower bits.
|
|
|
|
*
|
|
|
|
* Note: When called for a request that is queued on a non-multiqueue request
|
|
|
|
* queue, the hardware context index is set to zero.
|
|
|
|
*/
|
|
|
|
u32 blk_mq_unique_tag(struct request *rq)
|
|
|
|
{
|
2018-10-30 05:06:13 +08:00
|
|
|
return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
|
2014-10-30 21:45:11 +08:00
|
|
|
(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_unique_tag);
|