2019-05-01 02:42:43 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 1991, 1992 Linus Torvalds
|
|
|
|
* Copyright (C) 1994, Karl Keyte: Added support for disk statistics
|
|
|
|
* Elevator latency, (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
|
|
|
|
* Queue request tables / lock, selectable elevator, Jens Axboe <axboe@suse.de>
|
2008-01-31 20:03:55 +08:00
|
|
|
* kernel-doc documentation started by NeilBrown <neilb@cse.unsw.edu.au>
|
|
|
|
* - July2000
|
2005-04-17 06:20:36 +08:00
|
|
|
* bio rewrite, highmem i/o, etc, Jens Axboe <axboe@suse.de> - may 2001
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This handles all read/write requests to block devices
|
|
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
#include <linux/blk-mq.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/kernel_stat.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/writeback.h>
|
2006-12-10 18:19:35 +08:00
|
|
|
#include <linux/task_io_accounting_ops.h>
|
2006-12-08 18:39:46 +08:00
|
|
|
#include <linux/fault-inject.h>
|
2011-03-08 20:19:51 +08:00
|
|
|
#include <linux/list_sort.h>
|
2011-10-19 20:32:38 +08:00
|
|
|
#include <linux/delay.h>
|
2012-04-20 07:29:22 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2013-03-23 11:42:26 +08:00
|
|
|
#include <linux/pm_runtime.h>
|
2015-05-23 05:13:17 +08:00
|
|
|
#include <linux/blk-cgroup.h>
|
2019-09-16 23:44:29 +08:00
|
|
|
#include <linux/t10-pi.h>
|
2017-02-01 06:53:20 +08:00
|
|
|
#include <linux/debugfs.h>
|
2018-02-07 06:05:39 +08:00
|
|
|
#include <linux/bpf.h>
|
2019-08-09 03:03:00 +08:00
|
|
|
#include <linux/psi.h>
|
tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:
- zero-copy and per-cpu splice() tracing
- binary tracing without printf overhead
- structured logging records exposed under /debug/tracing/events
- trace events embedded in function tracer output and other plugins
- user-defined, per tracepoint filter expressions
...
Cons:
- no dev_t info for the output of plug, unplug_timer and unplug_io events.
no dev_t info for getrq and sleeprq events if bio == NULL.
no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
This is mainly because we can't get the deivce from a request queue.
But this may change in the future.
- A packet command is converted to a string in TP_assign, not TP_print.
While blktrace do the convertion just before output.
Since pc requests should be rather rare, this is not a big issue.
- In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
has a unique format, which means we have some unused data in a trace entry.
The overhead is minimized by using __dynamic_array() instead of __array().
I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s
So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.
And the binary output of TRACE_EVENT is much smaller than blktrace:
# ls -l -h
-rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
-rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
-rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
Following are some comparisons between TRACE_EVENT and blktrace:
plug:
kjournald-480 [000] 303.084981: block_plug: [kjournald]
kjournald-480 [000] 303.084981: 8,0 P N [kjournald]
unplug_io:
kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1
remap:
kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384
bio_backmerge:
kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald]
getrq:
kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald]
bash-2066 [001] 1072.953770: 8,0 G N [bash]
bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
rq_complete:
konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0]
ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0]
ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
rq_insert:
kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald]
Changelog from v2 -> v3:
- use the newly introduced __dynamic_array().
Changelog from v1 -> v2:
- use __string() instead of __array() to minimize the memory required
to store hex dump of rq->cmd().
- support large pc requests.
- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
- some cleanups.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 13:43:05 +08:00
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/block.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-01-29 21:51:59 +08:00
|
|
|
#include "blk.h"
|
2013-12-26 21:31:35 +08:00
|
|
|
#include "blk-mq.h"
|
2017-01-17 21:03:22 +08:00
|
|
|
#include "blk-mq-sched.h"
|
2018-09-27 05:01:03 +08:00
|
|
|
#include "blk-pm.h"
|
2018-07-03 23:14:59 +08:00
|
|
|
#include "blk-rq-qos.h"
|
2008-01-29 21:51:59 +08:00
|
|
|
|
2017-02-01 06:53:20 +08:00
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
struct dentry *blk_debugfs_root;
|
|
|
|
#endif
|
|
|
|
|
2010-11-16 19:52:38 +08:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
|
2009-10-02 03:16:13 +08:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
|
2013-04-19 00:00:26 +08:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
|
2014-04-29 02:30:52 +08:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_split);
|
2012-12-15 03:49:27 +08:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
|
2008-11-26 18:59:56 +08:00
|
|
|
|
2011-12-14 07:33:37 +08:00
|
|
|
DEFINE_IDA(blk_queue_ida);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* For queue allocation
|
|
|
|
*/
|
2008-01-31 20:03:55 +08:00
|
|
|
struct kmem_cache *blk_requestq_cachep;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Controlling structure to kblockd
|
|
|
|
*/
|
2006-01-09 23:02:34 +08:00
|
|
|
static struct workqueue_struct *kblockd_workqueue;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-03-08 09:10:04 +08:00
|
|
|
/**
|
|
|
|
* blk_queue_flag_set - atomically set a queue flag
|
|
|
|
* @flag: flag to be set
|
|
|
|
* @q: request queue
|
|
|
|
*/
|
|
|
|
void blk_queue_flag_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-15 00:02:07 +08:00
|
|
|
set_bit(flag, &q->queue_flags);
|
2018-03-08 09:10:04 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_queue_flag_set);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_queue_flag_clear - atomically clear a queue flag
|
|
|
|
* @flag: flag to be cleared
|
|
|
|
* @q: request queue
|
|
|
|
*/
|
|
|
|
void blk_queue_flag_clear(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-15 00:02:07 +08:00
|
|
|
clear_bit(flag, &q->queue_flags);
|
2018-03-08 09:10:04 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_queue_flag_clear);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_queue_flag_test_and_set - atomically test and set a queue flag
|
|
|
|
* @flag: flag to be set
|
|
|
|
* @q: request queue
|
|
|
|
*
|
|
|
|
* Returns the previous value of @flag - 0 if the flag was not set and 1 if
|
|
|
|
* the flag was already set.
|
|
|
|
*/
|
|
|
|
bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q)
|
|
|
|
{
|
2018-11-15 00:02:07 +08:00
|
|
|
return test_and_set_bit(flag, &q->queue_flags);
|
2018-03-08 09:10:04 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
|
|
|
|
|
2008-04-29 15:54:36 +08:00
|
|
|
void blk_rq_init(struct request_queue *q, struct request *rq)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2008-04-25 18:26:28 +08:00
|
|
|
memset(rq, 0, sizeof(*rq));
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
2008-02-08 19:41:03 +08:00
|
|
|
rq->q = q;
|
2009-05-07 21:24:44 +08:00
|
|
|
rq->__sector = (sector_t) -1;
|
2006-07-13 17:55:04 +08:00
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
2008-02-08 19:41:03 +08:00
|
|
|
rq->tag = -1;
|
2017-01-17 21:03:22 +08:00
|
|
|
rq->internal_tag = -1;
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
rq->start_time_ns = ktime_get_ns();
|
2011-01-05 23:57:38 +08:00
|
|
|
rq->part = NULL;
|
2019-03-08 05:37:18 +08:00
|
|
|
refcount_set(&rq->ref, 1);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2008-04-29 15:54:36 +08:00
|
|
|
EXPORT_SYMBOL(blk_rq_init);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-06-21 01:59:16 +08:00
|
|
|
#define REQ_OP_NAME(name) [REQ_OP_##name] = #name
|
|
|
|
static const char *const blk_op_name[] = {
|
|
|
|
REQ_OP_NAME(READ),
|
|
|
|
REQ_OP_NAME(WRITE),
|
|
|
|
REQ_OP_NAME(FLUSH),
|
|
|
|
REQ_OP_NAME(DISCARD),
|
|
|
|
REQ_OP_NAME(SECURE_ERASE),
|
|
|
|
REQ_OP_NAME(ZONE_RESET),
|
2019-08-02 01:26:36 +08:00
|
|
|
REQ_OP_NAME(ZONE_RESET_ALL),
|
2019-10-27 22:05:45 +08:00
|
|
|
REQ_OP_NAME(ZONE_OPEN),
|
|
|
|
REQ_OP_NAME(ZONE_CLOSE),
|
|
|
|
REQ_OP_NAME(ZONE_FINISH),
|
2019-06-21 01:59:16 +08:00
|
|
|
REQ_OP_NAME(WRITE_SAME),
|
|
|
|
REQ_OP_NAME(WRITE_ZEROES),
|
|
|
|
REQ_OP_NAME(SCSI_IN),
|
|
|
|
REQ_OP_NAME(SCSI_OUT),
|
|
|
|
REQ_OP_NAME(DRV_IN),
|
|
|
|
REQ_OP_NAME(DRV_OUT),
|
|
|
|
};
|
|
|
|
#undef REQ_OP_NAME
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_op_str - Return string XXX in the REQ_OP_XXX.
|
|
|
|
* @op: REQ_OP_XXX.
|
|
|
|
*
|
|
|
|
* Description: Centralize block layer function to convert REQ_OP_XXX into
|
|
|
|
* string format. Useful in the debugging and tracing bio or request. For
|
|
|
|
* invalid REQ_OP_XXX it returns string "UNKNOWN".
|
|
|
|
*/
|
|
|
|
inline const char *blk_op_str(unsigned int op)
|
|
|
|
{
|
|
|
|
const char *op_str = "UNKNOWN";
|
|
|
|
|
|
|
|
if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op])
|
|
|
|
op_str = blk_op_name[op];
|
|
|
|
|
|
|
|
return op_str;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_op_str);
|
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
static const struct {
|
|
|
|
int errno;
|
|
|
|
const char *name;
|
|
|
|
} blk_errors[] = {
|
|
|
|
[BLK_STS_OK] = { 0, "" },
|
|
|
|
[BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" },
|
|
|
|
[BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" },
|
|
|
|
[BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" },
|
|
|
|
[BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" },
|
|
|
|
[BLK_STS_TARGET] = { -EREMOTEIO, "critical target" },
|
|
|
|
[BLK_STS_NEXUS] = { -EBADE, "critical nexus" },
|
|
|
|
[BLK_STS_MEDIUM] = { -ENODATA, "critical medium" },
|
|
|
|
[BLK_STS_PROTECTION] = { -EILSEQ, "protection" },
|
|
|
|
[BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" },
|
2018-01-31 11:04:57 +08:00
|
|
|
[BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" },
|
2017-06-20 20:05:46 +08:00
|
|
|
[BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" },
|
2017-06-03 15:38:04 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
/* device mapper special case, should not leak out: */
|
|
|
|
[BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" },
|
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
/* everything else not covered above: */
|
|
|
|
[BLK_STS_IOERR] = { -EIO, "I/O" },
|
|
|
|
};
|
|
|
|
|
|
|
|
blk_status_t errno_to_blk_status(int errno)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(blk_errors); i++) {
|
|
|
|
if (blk_errors[i].errno == errno)
|
|
|
|
return (__force blk_status_t)i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(errno_to_blk_status);
|
|
|
|
|
|
|
|
int blk_status_to_errno(blk_status_t status)
|
|
|
|
{
|
|
|
|
int idx = (__force int)status;
|
|
|
|
|
2017-06-22 01:55:46 +08:00
|
|
|
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
|
2017-06-03 15:38:04 +08:00
|
|
|
return -EIO;
|
|
|
|
return blk_errors[idx].errno;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_status_to_errno);
|
|
|
|
|
2019-06-21 01:59:15 +08:00
|
|
|
static void print_req_error(struct request *req, blk_status_t status,
|
|
|
|
const char *caller)
|
2017-06-03 15:38:04 +08:00
|
|
|
{
|
|
|
|
int idx = (__force int)status;
|
|
|
|
|
2017-06-22 01:55:46 +08:00
|
|
|
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
|
2017-06-03 15:38:04 +08:00
|
|
|
return;
|
|
|
|
|
2019-06-21 01:59:15 +08:00
|
|
|
printk_ratelimited(KERN_ERR
|
2019-06-21 01:59:18 +08:00
|
|
|
"%s: %s error, dev %s, sector %llu op 0x%x:(%s) flags 0x%x "
|
|
|
|
"phys_seg %u prio class %u\n",
|
2019-06-21 01:59:15 +08:00
|
|
|
caller, blk_errors[idx].name,
|
2019-06-21 01:59:18 +08:00
|
|
|
req->rq_disk ? req->rq_disk->disk_name : "?",
|
|
|
|
blk_rq_pos(req), req_op(req), blk_op_str(req_op(req)),
|
|
|
|
req->cmd_flags & ~REQ_OP_MASK,
|
|
|
|
req->nr_phys_segments,
|
|
|
|
IOPRIO_PRIO_CLASS(req->ioprio));
|
2017-06-03 15:38:04 +08:00
|
|
|
}
|
|
|
|
|
2007-09-27 18:46:13 +08:00
|
|
|
static void req_bio_endio(struct request *rq, struct bio *bio,
|
2017-06-03 15:38:04 +08:00
|
|
|
unsigned int nbytes, blk_status_t error)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2015-06-26 22:01:13 +08:00
|
|
|
if (error)
|
2017-06-03 15:38:06 +08:00
|
|
|
bio->bi_status = error;
|
2006-01-06 16:51:03 +08:00
|
|
|
|
2016-10-20 21:12:13 +08:00
|
|
|
if (unlikely(rq->rq_flags & RQF_QUIET))
|
2015-07-25 02:37:59 +08:00
|
|
|
bio_set_flag(bio, BIO_QUIET);
|
block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set
Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.
This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.
During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.
My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.
I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.
Thanks for John Stultz for the quiet_error finalization.
Submitted-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-25 17:24:35 +08:00
|
|
|
|
2012-09-21 07:38:30 +08:00
|
|
|
bio_advance(bio, nbytes);
|
2008-07-01 02:04:41 +08:00
|
|
|
|
2011-01-25 19:43:52 +08:00
|
|
|
/* don't actually finish bio if it's part of flush sequence */
|
2016-10-20 21:12:13 +08:00
|
|
|
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
|
2015-07-20 21:29:37 +08:00
|
|
|
bio_endio(bio);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void blk_dump_rq_flags(struct request *rq, char *msg)
|
|
|
|
{
|
2017-01-31 23:57:31 +08:00
|
|
|
printk(KERN_INFO "%s: dev %s: flags=%llx\n", msg,
|
|
|
|
rq->rq_disk ? rq->rq_disk->disk_name : "?",
|
2013-05-23 18:25:08 +08:00
|
|
|
(unsigned long long) rq->cmd_flags);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-05-07 21:24:39 +08:00
|
|
|
printk(KERN_INFO " sector %llu, nr/cnr %u/%u\n",
|
|
|
|
(unsigned long long)blk_rq_pos(rq),
|
|
|
|
blk_rq_sectors(rq), blk_rq_cur_sectors(rq));
|
2014-04-10 23:46:28 +08:00
|
|
|
printk(KERN_INFO " bio %p, biotail %p, len %u\n",
|
|
|
|
rq->bio, rq->biotail, blk_rq_bytes(rq));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_dump_rq_flags);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_sync_queue - cancel any pending callbacks on a queue
|
|
|
|
* @q: the queue
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* The block layer may perform asynchronous callback activity
|
|
|
|
* on a queue, such as calling the unplug function after a timeout.
|
|
|
|
* A block device may call blk_sync_queue to ensure that any
|
|
|
|
* such activity is cancelled, thus allowing it to release resources
|
2007-05-09 14:57:56 +08:00
|
|
|
* that the callbacks might use. The caller must already have made sure
|
2005-04-17 06:20:36 +08:00
|
|
|
* that its ->make_request_fn will not re-add plugging prior to calling
|
|
|
|
* this function.
|
|
|
|
*
|
2011-03-03 08:05:33 +08:00
|
|
|
* This function does not cancel any asynchronous activity arising
|
2014-09-09 00:27:23 +08:00
|
|
|
* out of elevator or throttling code. That would require elevator_exit()
|
2012-03-06 05:15:12 +08:00
|
|
|
* and blkcg_exit_queue() to be called with queue lock initialized.
|
2011-03-03 08:05:33 +08:00
|
|
|
*
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
void blk_sync_queue(struct request_queue *q)
|
|
|
|
{
|
2008-11-19 21:38:39 +08:00
|
|
|
del_timer_sync(&q->timeout);
|
2017-10-20 01:00:48 +08:00
|
|
|
cancel_work_sync(&q->timeout_work);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_sync_queue);
|
|
|
|
|
2017-11-10 02:49:57 +08:00
|
|
|
/**
|
2018-09-27 05:01:04 +08:00
|
|
|
* blk_set_pm_only - increment pm_only counter
|
2017-11-10 02:49:57 +08:00
|
|
|
* @q: request queue pointer
|
|
|
|
*/
|
2018-09-27 05:01:04 +08:00
|
|
|
void blk_set_pm_only(struct request_queue *q)
|
2017-11-10 02:49:57 +08:00
|
|
|
{
|
2018-09-27 05:01:04 +08:00
|
|
|
atomic_inc(&q->pm_only);
|
2017-11-10 02:49:57 +08:00
|
|
|
}
|
2018-09-27 05:01:04 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_set_pm_only);
|
2017-11-10 02:49:57 +08:00
|
|
|
|
2018-09-27 05:01:04 +08:00
|
|
|
void blk_clear_pm_only(struct request_queue *q)
|
2017-11-10 02:49:57 +08:00
|
|
|
{
|
2018-09-27 05:01:04 +08:00
|
|
|
int pm_only;
|
|
|
|
|
|
|
|
pm_only = atomic_dec_return(&q->pm_only);
|
|
|
|
WARN_ON_ONCE(pm_only < 0);
|
|
|
|
if (pm_only == 0)
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2017-11-10 02:49:57 +08:00
|
|
|
}
|
2018-09-27 05:01:04 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_clear_pm_only);
|
2017-11-10 02:49:57 +08:00
|
|
|
|
2007-07-24 15:28:11 +08:00
|
|
|
void blk_put_queue(struct request_queue *q)
|
2006-03-19 07:34:37 +08:00
|
|
|
{
|
|
|
|
kobject_put(&q->kobj);
|
|
|
|
}
|
2011-05-27 13:44:43 +08:00
|
|
|
EXPORT_SYMBOL(blk_put_queue);
|
2006-03-19 07:34:37 +08:00
|
|
|
|
2014-12-23 05:04:42 +08:00
|
|
|
void blk_set_queue_dying(struct request_queue *q)
|
|
|
|
{
|
2018-03-08 09:10:04 +08:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
|
2014-12-23 05:04:42 +08:00
|
|
|
|
2017-03-27 20:06:58 +08:00
|
|
|
/*
|
|
|
|
* When queue DYING flag is set, we need to block new req
|
|
|
|
* entering queue, so we call blk_freeze_queue_start() to
|
|
|
|
* prevent I/O from crossing blk_queue_enter().
|
|
|
|
*/
|
|
|
|
blk_freeze_queue_start(q);
|
|
|
|
|
2018-11-16 03:22:51 +08:00
|
|
|
if (queue_is_mq(q))
|
2014-12-23 05:04:42 +08:00
|
|
|
blk_mq_wake_waiters(q);
|
2017-11-10 02:49:53 +08:00
|
|
|
|
|
|
|
/* Make blk_queue_enter() reexamine the DYING flag. */
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2014-12-23 05:04:42 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_set_queue_dying);
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 20:42:16 +08:00
|
|
|
/**
|
|
|
|
* blk_cleanup_queue - shutdown a request queue
|
|
|
|
* @q: request queue to shutdown
|
|
|
|
*
|
2012-12-06 21:32:01 +08:00
|
|
|
* Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
|
|
|
|
* put it. All future requests will be failed immediately with -ENODEV.
|
2011-03-03 08:04:42 +08:00
|
|
|
*/
|
2008-01-31 20:03:55 +08:00
|
|
|
void blk_cleanup_queue(struct request_queue *q)
|
2006-03-19 07:34:37 +08:00
|
|
|
{
|
2019-10-01 07:00:43 +08:00
|
|
|
WARN_ON_ONCE(blk_queue_registered(q));
|
|
|
|
|
2012-11-28 20:42:38 +08:00
|
|
|
/* mark @q DYING, no new request or merges will be allowed afterwards */
|
2014-12-23 05:04:42 +08:00
|
|
|
blk_set_queue_dying(q);
|
2012-03-06 05:14:59 +08:00
|
|
|
|
2018-11-15 00:02:07 +08:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q);
|
|
|
|
blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 20:42:16 +08:00
|
|
|
|
2012-12-06 21:32:01 +08:00
|
|
|
/*
|
|
|
|
* Drain all requests queued before DYING marking. Set DEAD flag to
|
2019-08-02 06:39:55 +08:00
|
|
|
* prevent that blk_mq_run_hw_queues() accesses the hardware queues
|
|
|
|
* after draining finished.
|
2012-12-06 21:32:01 +08:00
|
|
|
*/
|
2015-10-22 01:20:12 +08:00
|
|
|
blk_freeze_queue(q);
|
2018-10-24 21:18:09 +08:00
|
|
|
|
|
|
|
rq_qos_exit(q);
|
|
|
|
|
2018-11-15 00:02:07 +08:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 20:42:16 +08:00
|
|
|
|
2015-10-22 01:20:23 +08:00
|
|
|
/* for synchronous bio-based driver finish in-flight integrity i/o */
|
|
|
|
blk_flush_integrity();
|
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 20:42:16 +08:00
|
|
|
/* @q won't process any more request, flush async actions */
|
2017-02-02 22:56:50 +08:00
|
|
|
del_timer_sync(&q->backing_dev_info->laptop_mode_wb_timer);
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 20:42:16 +08:00
|
|
|
blk_sync_queue(q);
|
|
|
|
|
2018-11-16 03:22:51 +08:00
|
|
|
if (queue_is_mq(q))
|
blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:52:25 +08:00
|
|
|
blk_mq_exit_queue(q);
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
|
block: free sched's request pool in blk_cleanup_queue
In theory, IO scheduler belongs to request queue, and the request pool
of sched tags belongs to the request queue too.
However, the current tags allocation interfaces are re-used for both
driver tags and sched tags, and driver tags is definitely host wide,
and doesn't belong to any request queue, same with its request pool.
So we need tagset instance for freeing request of sched tags.
Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
tags to be freed before calling blk_mq_free_tag_set().
Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
moves blk_exit_queue into __blk_release_queue for simplying the fast
path in generic_make_request(), then causes oops during freeing requests
of sched tags in __blk_release_queue().
Fix the above issue by move freeing request pool of sched tags into
blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
in-queue requests at that time. Freeing sched tags has to be kept in queue's
release handler becasue there might be un-completed dispatch activity
which might refer to sched tags.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-04 21:08:02 +08:00
|
|
|
/*
|
|
|
|
* In theory, request pool of sched_tags belongs to request queue.
|
|
|
|
* However, the current implementation requires tag_set for freeing
|
|
|
|
* requests, so free the pool now.
|
|
|
|
*
|
|
|
|
* Queue has become frozen, there can't be any in-queue requests, so
|
|
|
|
* it is safe to free requests now.
|
|
|
|
*/
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
if (q->elevator)
|
|
|
|
blk_mq_sched_free_requests(q);
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
|
2015-10-22 01:20:12 +08:00
|
|
|
percpu_ref_exit(&q->q_usage_counter);
|
2014-12-09 23:57:48 +08:00
|
|
|
|
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.
This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.
With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.
sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:
Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[<ffffffff8137d774>] elv_merge+0x84/0xe0
[<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
[<ffffffff813838ea>] generic_make_request+0xca/0x100
[<ffffffff81383994>] submit_bio+0x74/0x100
[<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
[<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
[<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
[<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff8118c1ca>] do_sync_read+0xda/0x120
[<ffffffff8118ce55>] vfs_read+0xc5/0x180
[<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
[<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.
Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.
The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.
This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.
* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.
* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.
* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.
* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 20:42:16 +08:00
|
|
|
/* @q is and will stay empty, shutdown and put */
|
2006-03-19 07:34:37 +08:00
|
|
|
blk_put_queue(q);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
EXPORT_SYMBOL(blk_cleanup_queue);
|
|
|
|
|
2017-11-10 02:49:58 +08:00
|
|
|
/**
|
|
|
|
* blk_queue_enter() - try to increase q->q_usage_counter
|
|
|
|
* @q: request queue pointer
|
|
|
|
* @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
|
|
|
|
*/
|
2017-11-10 02:49:59 +08:00
|
|
|
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
|
2015-10-22 01:20:12 +08:00
|
|
|
{
|
2018-09-27 05:01:04 +08:00
|
|
|
const bool pm = flags & BLK_MQ_REQ_PREEMPT;
|
2017-11-10 02:49:58 +08:00
|
|
|
|
2015-10-22 01:20:12 +08:00
|
|
|
while (true) {
|
2017-11-10 02:49:58 +08:00
|
|
|
bool success = false;
|
2015-10-22 01:20:12 +08:00
|
|
|
|
2018-03-20 02:46:13 +08:00
|
|
|
rcu_read_lock();
|
2017-11-10 02:49:58 +08:00
|
|
|
if (percpu_ref_tryget_live(&q->q_usage_counter)) {
|
|
|
|
/*
|
2018-09-27 05:01:04 +08:00
|
|
|
* The code that increments the pm_only counter is
|
|
|
|
* responsible for ensuring that that counter is
|
|
|
|
* globally visible before the queue is unfrozen.
|
2017-11-10 02:49:58 +08:00
|
|
|
*/
|
2018-09-27 05:01:04 +08:00
|
|
|
if (pm || !blk_queue_pm_only(q)) {
|
2017-11-10 02:49:58 +08:00
|
|
|
success = true;
|
|
|
|
} else {
|
|
|
|
percpu_ref_put(&q->q_usage_counter);
|
|
|
|
}
|
|
|
|
}
|
2018-03-20 02:46:13 +08:00
|
|
|
rcu_read_unlock();
|
2017-11-10 02:49:58 +08:00
|
|
|
|
|
|
|
if (success)
|
2015-10-22 01:20:12 +08:00
|
|
|
return 0;
|
|
|
|
|
2017-11-10 02:49:58 +08:00
|
|
|
if (flags & BLK_MQ_REQ_NOWAIT)
|
2015-10-22 01:20:12 +08:00
|
|
|
return -EBUSY;
|
|
|
|
|
2017-03-27 20:06:56 +08:00
|
|
|
/*
|
2017-03-27 20:06:57 +08:00
|
|
|
* read pair of barrier in blk_freeze_queue_start(),
|
2017-03-27 20:06:56 +08:00
|
|
|
* we need to order reading __PERCPU_REF_DEAD flag of
|
2017-03-27 20:06:58 +08:00
|
|
|
* .q_usage_counter and reading .mq_freeze_depth or
|
|
|
|
* queue dying flag, otherwise the following wait may
|
|
|
|
* never return if the two reads are reordered.
|
2017-03-27 20:06:56 +08:00
|
|
|
*/
|
|
|
|
smp_rmb();
|
|
|
|
|
2018-04-13 02:11:58 +08:00
|
|
|
wait_event(q->mq_freeze_wq,
|
2019-05-21 11:25:55 +08:00
|
|
|
(!q->mq_freeze_depth &&
|
2018-09-27 05:01:06 +08:00
|
|
|
(pm || (blk_pm_request_resume(q),
|
|
|
|
!blk_queue_pm_only(q)))) ||
|
2018-04-13 02:11:58 +08:00
|
|
|
blk_queue_dying(q));
|
2015-10-22 01:20:12 +08:00
|
|
|
if (blk_queue_dying(q))
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void blk_queue_exit(struct request_queue *q)
|
|
|
|
{
|
|
|
|
percpu_ref_put(&q->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_queue_usage_counter_release(struct percpu_ref *ref)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
|
|
|
container_of(ref, struct request_queue, q_usage_counter);
|
|
|
|
|
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
|
|
|
}
|
|
|
|
|
2017-08-29 06:03:41 +08:00
|
|
|
static void blk_rq_timed_out_timer(struct timer_list *t)
|
2015-10-30 20:57:30 +08:00
|
|
|
{
|
2017-08-29 06:03:41 +08:00
|
|
|
struct request_queue *q = from_timer(q, t, timeout);
|
2015-10-30 20:57:30 +08:00
|
|
|
|
|
|
|
kblockd_schedule_work(&q->timeout_work);
|
|
|
|
}
|
|
|
|
|
2019-01-30 21:21:45 +08:00
|
|
|
static void blk_timeout_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2020-03-27 16:30:11 +08:00
|
|
|
struct request_queue *__blk_alloc_queue(int node_id)
|
2005-06-23 15:08:19 +08:00
|
|
|
{
|
2007-07-24 15:28:11 +08:00
|
|
|
struct request_queue *q;
|
2018-05-21 06:25:47 +08:00
|
|
|
int ret;
|
2005-06-23 15:08:19 +08:00
|
|
|
|
2008-01-29 21:51:59 +08:00
|
|
|
q = kmem_cache_alloc_node(blk_requestq_cachep,
|
2020-03-27 16:30:11 +08:00
|
|
|
GFP_KERNEL | __GFP_ZERO, node_id);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!q)
|
|
|
|
return NULL;
|
|
|
|
|
2018-06-01 01:11:36 +08:00
|
|
|
q->last_merge = NULL;
|
|
|
|
|
2020-03-27 16:30:11 +08:00
|
|
|
q->id = ida_simple_get(&blk_queue_ida, 0, 0, GFP_KERNEL);
|
2011-12-14 07:33:37 +08:00
|
|
|
if (q->id < 0)
|
2014-05-27 23:35:14 +08:00
|
|
|
goto fail_q;
|
2011-12-14 07:33:37 +08:00
|
|
|
|
2018-05-21 06:25:47 +08:00
|
|
|
ret = bioset_init(&q->bio_split, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
|
|
|
|
if (ret)
|
2015-04-24 13:37:18 +08:00
|
|
|
goto fail_id;
|
|
|
|
|
2020-03-27 16:30:11 +08:00
|
|
|
q->backing_dev_info = bdi_alloc_node(GFP_KERNEL, node_id);
|
2017-02-02 22:56:51 +08:00
|
|
|
if (!q->backing_dev_info)
|
|
|
|
goto fail_split;
|
|
|
|
|
2017-03-22 07:20:01 +08:00
|
|
|
q->stats = blk_alloc_queue_stats();
|
|
|
|
if (!q->stats)
|
|
|
|
goto fail_stats;
|
|
|
|
|
2019-03-12 14:28:13 +08:00
|
|
|
q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES;
|
2017-02-02 22:56:50 +08:00
|
|
|
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
|
|
|
|
q->backing_dev_info->name = "block";
|
2011-11-23 17:59:13 +08:00
|
|
|
q->node = node_id;
|
2009-06-12 20:42:56 +08:00
|
|
|
|
2017-08-29 06:03:41 +08:00
|
|
|
timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
|
|
|
|
laptop_mode_timer_fn, 0);
|
|
|
|
timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
|
2019-01-30 21:21:45 +08:00
|
|
|
INIT_WORK(&q->timeout_work, blk_timeout_work);
|
2011-12-14 07:33:41 +08:00
|
|
|
INIT_LIST_HEAD(&q->icq_list);
|
2012-03-06 05:15:18 +08:00
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
2012-03-06 05:15:20 +08:00
|
|
|
INIT_LIST_HEAD(&q->blkg_list);
|
2012-03-06 05:15:18 +08:00
|
|
|
#endif
|
2006-03-19 07:34:37 +08:00
|
|
|
|
2008-01-29 21:51:59 +08:00
|
|
|
kobject_init(&q->kobj, &blk_queue_ktype);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-09-21 03:12:20 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
|
|
|
mutex_init(&q->blk_trace_mutex);
|
|
|
|
#endif
|
2006-03-19 07:34:37 +08:00
|
|
|
mutex_init(&q->sysfs_lock);
|
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 19:01:48 +08:00
|
|
|
mutex_init(&q->sysfs_dir_lock);
|
2018-11-16 03:17:28 +08:00
|
|
|
spin_lock_init(&q->queue_lock);
|
2011-03-03 08:04:42 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
init_waitqueue_head(&q->mq_freeze_wq);
|
2019-05-21 11:25:55 +08:00
|
|
|
mutex_init(&q->mq_freeze_lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
2015-10-22 01:20:12 +08:00
|
|
|
/*
|
|
|
|
* Init percpu_ref in atomic mode so that it's faster to shutdown.
|
|
|
|
* See blk_register_queue() for details.
|
|
|
|
*/
|
|
|
|
if (percpu_ref_init(&q->q_usage_counter,
|
|
|
|
blk_queue_usage_counter_release,
|
|
|
|
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
|
2013-10-15 00:11:36 +08:00
|
|
|
goto fail_bdi;
|
2012-03-06 05:15:05 +08:00
|
|
|
|
2015-10-22 01:20:12 +08:00
|
|
|
if (blkcg_init_queue(q))
|
|
|
|
goto fail_ref;
|
|
|
|
|
2020-03-27 16:30:11 +08:00
|
|
|
blk_queue_dma_alignment(q, 511);
|
|
|
|
blk_set_default_limits(&q->limits);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
return q;
|
2011-12-14 07:33:37 +08:00
|
|
|
|
2015-10-22 01:20:12 +08:00
|
|
|
fail_ref:
|
|
|
|
percpu_ref_exit(&q->q_usage_counter);
|
2013-10-15 00:11:36 +08:00
|
|
|
fail_bdi:
|
2017-03-22 07:20:01 +08:00
|
|
|
blk_free_queue_stats(q->stats);
|
|
|
|
fail_stats:
|
2017-02-02 22:56:51 +08:00
|
|
|
bdi_put(q->backing_dev_info);
|
2015-04-24 13:37:18 +08:00
|
|
|
fail_split:
|
2018-05-21 06:25:47 +08:00
|
|
|
bioset_exit(&q->bio_split);
|
2011-12-14 07:33:37 +08:00
|
|
|
fail_id:
|
|
|
|
ida_simple_remove(&blk_queue_ida, q->id);
|
|
|
|
fail_q:
|
|
|
|
kmem_cache_free(blk_requestq_cachep, q);
|
|
|
|
return NULL;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2020-03-27 16:30:11 +08:00
|
|
|
|
|
|
|
struct request_queue *blk_alloc_queue(make_request_fn make_request, int node_id)
|
|
|
|
{
|
|
|
|
struct request_queue *q;
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!make_request))
|
2020-03-30 00:08:26 +08:00
|
|
|
return NULL;
|
2020-03-27 16:30:11 +08:00
|
|
|
|
|
|
|
q = __blk_alloc_queue(node_id);
|
|
|
|
if (!q)
|
|
|
|
return NULL;
|
|
|
|
q->make_request_fn = make_request;
|
|
|
|
q->nr_requests = BLKDEV_MAX_RQ;
|
|
|
|
return q;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_alloc_queue);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2011-12-14 07:33:38 +08:00
|
|
|
bool blk_get_queue(struct request_queue *q)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-11-28 20:42:38 +08:00
|
|
|
if (likely(!blk_queue_dying(q))) {
|
2011-12-14 07:33:38 +08:00
|
|
|
__blk_get_queue(q);
|
|
|
|
return true;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2011-12-14 07:33:38 +08:00
|
|
|
return false;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2011-05-27 13:44:43 +08:00
|
|
|
EXPORT_SYMBOL(blk_get_queue);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
/**
|
|
|
|
* blk_get_request - allocate a request
|
|
|
|
* @q: request queue to allocate a request for
|
|
|
|
* @op: operation (REQ_OP_*) and REQ_* flags, e.g. REQ_SYNC.
|
|
|
|
* @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
struct request *blk_get_request(struct request_queue *q, unsigned int op,
|
|
|
|
blk_mq_req_flags_t flags)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
struct request *req;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
WARN_ON_ONCE(op & REQ_NOWAIT);
|
|
|
|
WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
|
2005-04-17 06:20:36 +08:00
|
|
|
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
req = blk_mq_alloc_request(q, op, flags);
|
|
|
|
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
|
|
|
|
q->mq_ops->initialize_rq_fn(req);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
return req;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
EXPORT_SYMBOL(blk_get_request);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
void blk_put_request(struct request *req)
|
|
|
|
{
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
blk_mq_free_request(req);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_put_request);
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
bool bio_attempt_back_merge(struct request *req, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
2016-08-06 05:35:16 +08:00
|
|
|
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
if (!ll_back_merge_fn(req, bio, nr_segs))
|
2011-03-08 20:19:51 +08:00
|
|
|
return false;
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
trace_block_bio_backmerge(req->q, req, bio);
|
2019-08-29 06:05:54 +08:00
|
|
|
rq_qos_merge(req->q, req, bio);
|
2011-03-08 20:19:51 +08:00
|
|
|
|
|
|
|
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
|
|
|
|
req->biotail->bi_next = bio;
|
|
|
|
req->biotail = bio;
|
2013-10-12 06:44:27 +08:00
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
blk_account_io_start(req, false);
|
2011-03-08 20:19:51 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
bool bio_attempt_front_merge(struct request *req, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
2016-08-06 05:35:16 +08:00
|
|
|
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
if (!ll_front_merge_fn(req, bio, nr_segs))
|
2011-03-08 20:19:51 +08:00
|
|
|
return false;
|
|
|
|
|
2019-06-06 18:29:01 +08:00
|
|
|
trace_block_bio_frontmerge(req->q, req, bio);
|
2019-08-29 06:05:54 +08:00
|
|
|
rq_qos_merge(req->q, req, bio);
|
2011-03-08 20:19:51 +08:00
|
|
|
|
|
|
|
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
|
|
|
|
blk_rq_set_mixed_merge(req);
|
|
|
|
|
|
|
|
bio->bi_next = req->bio;
|
|
|
|
req->bio = bio;
|
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
req->__sector = bio->bi_iter.bi_sector;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
blk_account_io_start(req, false);
|
2011-03-08 20:19:51 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2017-02-08 21:46:49 +08:00
|
|
|
bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
|
|
|
|
struct bio *bio)
|
|
|
|
{
|
|
|
|
unsigned short segments = blk_rq_nr_discard_segments(req);
|
|
|
|
|
|
|
|
if (segments >= queue_max_discard_segments(q))
|
|
|
|
goto no_merge;
|
|
|
|
if (blk_rq_sectors(req) + bio_sectors(bio) >
|
|
|
|
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
|
|
|
|
goto no_merge;
|
|
|
|
|
2019-08-29 06:05:54 +08:00
|
|
|
rq_qos_merge(q, req, bio);
|
|
|
|
|
2017-02-08 21:46:49 +08:00
|
|
|
req->biotail->bi_next = bio;
|
|
|
|
req->biotail = bio;
|
|
|
|
req->__data_len += bio->bi_iter.bi_size;
|
|
|
|
req->nr_phys_segments = segments + 1;
|
|
|
|
|
|
|
|
blk_account_io_start(req, false);
|
|
|
|
return true;
|
|
|
|
no_merge:
|
|
|
|
req_set_nomerge(q, req);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2011-10-19 20:33:08 +08:00
|
|
|
/**
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
* blk_attempt_plug_merge - try to merge with %current's plugged list
|
2011-10-19 20:33:08 +08:00
|
|
|
* @q: request_queue new bio is being queued at
|
|
|
|
* @bio: new bio being queued
|
2019-06-06 18:29:01 +08:00
|
|
|
* @nr_segs: number of segments in @bio
|
2015-10-31 09:36:16 +08:00
|
|
|
* @same_queue_rq: pointer to &struct request that gets filled in when
|
|
|
|
* another request associated with @q is found on the plug list
|
|
|
|
* (optional, may be %NULL)
|
2011-10-19 20:33:08 +08:00
|
|
|
*
|
|
|
|
* Determine whether @bio being queued on @q can be merged with a request
|
|
|
|
* on %current's plugged list. Returns %true if merge was successful,
|
|
|
|
* otherwise %false.
|
|
|
|
*
|
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 16:19:42 +08:00
|
|
|
* Plugging coalesces IOs from the same issuer for the same purpose without
|
|
|
|
* going through @q->queue_lock. As such it's more of an issuing mechanism
|
|
|
|
* than scheduling, and the request, while may have elvpriv data, is not
|
|
|
|
* added on the elevator at this point. In addition, we don't have
|
|
|
|
* reliable access to the elevator outside queue lock. Only check basic
|
|
|
|
* merging parameters without querying the elevator.
|
2014-05-21 05:46:26 +08:00
|
|
|
*
|
|
|
|
* Caller must ensure !blk_queue_nomerges(q) beforehand.
|
2011-03-08 20:19:51 +08:00
|
|
|
*/
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
|
2019-06-06 18:29:01 +08:00
|
|
|
unsigned int nr_segs, struct request **same_queue_rq)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
|
|
|
struct blk_plug *plug;
|
|
|
|
struct request *rq;
|
2013-10-30 02:01:03 +08:00
|
|
|
struct list_head *plug_list;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
block: Disable write plugging for zoned block devices
Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.
Context A Context B
| blk_start_plug()
| ...
| seq_write_zone()
| mutex_lock(zone)
| bio-0->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-0)
| submit_bio(bio-0)
| bio-1->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-1)
| submit_bio(bio-1)
| mutex_unlock(zone)
| return
| -----------------------> | seq_write_zone()
| mutex_lock(zone)
| bio-2->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-2)
| submit_bio(bio-2)
| mutex_unlock(zone)
| <------------------------- |
| blk_finish_plug()
In the above example, despite the mutex synchronization ensuring the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 of context B, when the plug is released with
blk_finish_plug().
While this problem can be addressed using the blk_flush_plug_list()
function (in the above example, the call must be inserted before the
zone mutex lock is released), a simple generic solution in the block
layer avoid this additional code in all zoned block device user code.
The simple generic solution implemented with this patch is to introduce
the internal helper function blk_mq_plug() to access the current
context plug on BIO submission. This helper returns the current plug
only if the target device is not a zoned block device or if the BIO to
be plugged is not a write operation. Otherwise, the caller context plug
is ignored and NULL returned, resulting is all writes to zoned block
device to never be plugged.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11 00:18:31 +08:00
|
|
|
plug = blk_mq_plug(q, bio);
|
2011-03-08 20:19:51 +08:00
|
|
|
if (!plug)
|
2017-02-08 21:46:48 +08:00
|
|
|
return false;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
plug_list = &plug->mq_list;
|
2013-10-30 02:01:03 +08:00
|
|
|
|
|
|
|
list_for_each_entry_reverse(rq, plug_list, queuelist) {
|
2017-02-08 21:46:48 +08:00
|
|
|
bool merged = false;
|
2011-03-08 20:19:51 +08:00
|
|
|
|
2018-11-24 13:04:33 +08:00
|
|
|
if (rq->q == q && same_queue_rq) {
|
2015-05-09 01:51:33 +08:00
|
|
|
/*
|
|
|
|
* Only blk-mq multiple hardware queues case checks the
|
|
|
|
* rq in the same queue, there should be only one such
|
|
|
|
* rq in a queue
|
|
|
|
**/
|
2018-11-24 13:04:33 +08:00
|
|
|
*same_queue_rq = rq;
|
2015-05-09 01:51:33 +08:00
|
|
|
}
|
2011-08-24 22:04:34 +08:00
|
|
|
|
block: don't call elevator callbacks for plug merges
Plug merge calls two elevator callbacks outside queue lock -
elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although
attempt_plug_merge() suggests that elevator is guaranteed to be there
through the existing request on the plug list, nothing prevents plug
merge from calling into dying or initializing elevator.
For regular merges, bypass ensures elvpriv count to reach zero, which
in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
from forced back insertion. Plug merge doesn't check ELVPRIV, and, as
the requests haven't gone through elevator insertion yet, it doesn't
have SOFTBARRIER set allowing merges on a bypassed queue.
This, for example, leads to the following crash during elevator
switch.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
PGD 112cbc067 PUD 115d5c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 1
Modules linked in: deadline_iosched
Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
Stack:
ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
Call Trace:
[<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
[<ffffffff81391bf1>] elv_try_merge+0x21/0x40
[<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
[<ffffffff81396a5a>] generic_make_request+0xca/0x100
[<ffffffff81396b04>] submit_bio+0x74/0x100
[<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
[<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
[<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
[<ffffffff811986b2>] do_sync_read+0xe2/0x120
[<ffffffff81199345>] vfs_read+0xc5/0x180
[<ffffffff81199501>] sys_read+0x51/0x90
[<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
There are multiple ways to fix this including making plug merge check
ELVPRIV; however,
* Calling into elevator outside queue lock is confusing and
error-prone.
* Requests on plug list aren't known to the elevator. They aren't on
the elevator yet, so there's no elevator specific state to update.
* Given the nature of plug merges - collecting bio's for the same
purpose from the same issuer - elevator specific restrictions aren't
applicable.
So, simply don't call into elevator methods from plug merge by moving
elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
using blk_try_merge() in attempt_plug_merge().
This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.
Note that this makes per-cgroup merged stats skip plug merging.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4F16F3CA.90904@kernel.dk>
Original-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 16:19:42 +08:00
|
|
|
if (rq->q != q || !blk_rq_merge_ok(rq, bio))
|
2011-03-08 20:19:51 +08:00
|
|
|
continue;
|
|
|
|
|
2017-02-08 21:46:48 +08:00
|
|
|
switch (blk_try_merge(rq, bio)) {
|
|
|
|
case ELEVATOR_BACK_MERGE:
|
2019-06-06 18:29:01 +08:00
|
|
|
merged = bio_attempt_back_merge(rq, bio, nr_segs);
|
2017-02-08 21:46:48 +08:00
|
|
|
break;
|
|
|
|
case ELEVATOR_FRONT_MERGE:
|
2019-06-06 18:29:01 +08:00
|
|
|
merged = bio_attempt_front_merge(rq, bio, nr_segs);
|
2017-02-08 21:46:48 +08:00
|
|
|
break;
|
2017-02-08 21:46:49 +08:00
|
|
|
case ELEVATOR_DISCARD_MERGE:
|
|
|
|
merged = bio_attempt_discard_merge(q, rq, bio);
|
|
|
|
break;
|
2017-02-08 21:46:48 +08:00
|
|
|
default:
|
|
|
|
break;
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
2017-02-08 21:46:48 +08:00
|
|
|
|
|
|
|
if (merged)
|
|
|
|
return true;
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
2017-02-08 21:46:48 +08:00
|
|
|
|
|
|
|
return false;
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
|
|
|
|
2018-03-14 23:56:53 +08:00
|
|
|
static void handle_bad_sector(struct bio *bio, sector_t maxsector)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
|
|
|
|
printk(KERN_INFO "attempt to access beyond end of device\n");
|
2016-06-06 03:32:21 +08:00
|
|
|
printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
|
2017-08-24 01:10:32 +08:00
|
|
|
bio_devname(bio, b), bio->bi_opf,
|
2012-09-26 06:05:12 +08:00
|
|
|
(unsigned long long)bio_end_sector(bio),
|
2018-03-14 23:56:53 +08:00
|
|
|
(long long)maxsector);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2006-12-08 18:39:46 +08:00
|
|
|
#ifdef CONFIG_FAIL_MAKE_REQUEST
|
|
|
|
|
|
|
|
static DECLARE_FAULT_ATTR(fail_make_request);
|
|
|
|
|
|
|
|
static int __init setup_fail_make_request(char *str)
|
|
|
|
{
|
|
|
|
return setup_fault_attr(&fail_make_request, str);
|
|
|
|
}
|
|
|
|
__setup("fail_make_request=", setup_fail_make_request);
|
|
|
|
|
2011-07-27 07:09:03 +08:00
|
|
|
static bool should_fail_request(struct hd_struct *part, unsigned int bytes)
|
2006-12-08 18:39:46 +08:00
|
|
|
{
|
2011-07-27 07:09:03 +08:00
|
|
|
return part->make_it_fail && should_fail(&fail_make_request, bytes);
|
2006-12-08 18:39:46 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __init fail_make_request_debugfs(void)
|
|
|
|
{
|
2011-08-04 07:21:01 +08:00
|
|
|
struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
|
|
|
|
NULL, &fail_make_request);
|
|
|
|
|
2014-04-11 15:58:56 +08:00
|
|
|
return PTR_ERR_OR_ZERO(dir);
|
2006-12-08 18:39:46 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
late_initcall(fail_make_request_debugfs);
|
|
|
|
|
|
|
|
#else /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2011-07-27 07:09:03 +08:00
|
|
|
static inline bool should_fail_request(struct hd_struct *part,
|
|
|
|
unsigned int bytes)
|
2006-12-08 18:39:46 +08:00
|
|
|
{
|
2011-07-27 07:09:03 +08:00
|
|
|
return false;
|
2006-12-08 18:39:46 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_FAIL_MAKE_REQUEST */
|
|
|
|
|
2018-01-11 21:09:11 +08:00
|
|
|
static inline bool bio_check_ro(struct bio *bio, struct hd_struct *part)
|
|
|
|
{
|
2018-08-15 00:52:40 +08:00
|
|
|
const int op = bio_op(bio);
|
|
|
|
|
2018-09-06 06:14:36 +08:00
|
|
|
if (part->policy && op_is_write(op)) {
|
2018-01-11 21:09:11 +08:00
|
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
|
2018-09-06 06:14:36 +08:00
|
|
|
if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
|
|
|
|
return false;
|
|
|
|
|
Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.
The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.
This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=200439
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442
The lvm2 fix is here
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3
but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check. It now allows the
write to happen, but causes a warning, something like this:
generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
Workqueue: ksnaphd do_metadata
RIP: 0010:generic_make_request_checks+0x4ac/0x600
...
Call Trace:
generic_make_request+0x64/0x400
submit_bio+0x6c/0x140
dispatch_io+0x287/0x430
sync_io+0xc3/0x120
dm_io+0x1f8/0x220
do_metadata+0x1d/0x30
process_one_work+0x1b9/0x3e0
worker_thread+0x2b/0x3c0
kthread+0x113/0x130
ret_from_fork+0x35/0x40
Note that this is a "revert" in behavior only. I'm leaving alone the
actual code cleanups in commit 721c7fc701c7, but letting the previously
uncaught request go through with a warning instead of stopping it.
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-04 03:22:09 +08:00
|
|
|
WARN_ONCE(1,
|
2018-01-11 21:09:11 +08:00
|
|
|
"generic_make_request: Trying to write "
|
|
|
|
"to read-only block-device %s (partno %d)\n",
|
|
|
|
bio_devname(bio, b), part->partno);
|
Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.
The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.
This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=200439
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442
The lvm2 fix is here
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3
but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check. It now allows the
write to happen, but causes a warning, something like this:
generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
Workqueue: ksnaphd do_metadata
RIP: 0010:generic_make_request_checks+0x4ac/0x600
...
Call Trace:
generic_make_request+0x64/0x400
submit_bio+0x6c/0x140
dispatch_io+0x287/0x430
sync_io+0xc3/0x120
dm_io+0x1f8/0x220
do_metadata+0x1d/0x30
process_one_work+0x1b9/0x3e0
worker_thread+0x2b/0x3c0
kthread+0x113/0x130
ret_from_fork+0x35/0x40
Note that this is a "revert" in behavior only. I'm leaving alone the
actual code cleanups in commit 721c7fc701c7, but letting the previously
uncaught request go through with a warning instead of stopping it.
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-04 03:22:09 +08:00
|
|
|
/* Older lvm-tools actually trigger this */
|
|
|
|
return false;
|
2018-01-11 21:09:11 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-02-07 06:05:39 +08:00
|
|
|
static noinline int should_fail_bio(struct bio *bio)
|
|
|
|
{
|
|
|
|
if (should_fail_request(&bio->bi_disk->part0, bio->bi_iter.bi_size))
|
|
|
|
return -EIO;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
|
|
|
|
|
2018-03-14 23:56:53 +08:00
|
|
|
/*
|
|
|
|
* Check whether this bio extends beyond the end of the device or partition.
|
|
|
|
* This may well happen - the kernel calls bread() without checking the size of
|
|
|
|
* the device, e.g., when mounting a file system.
|
|
|
|
*/
|
|
|
|
static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
|
|
|
|
{
|
|
|
|
unsigned int nr_sectors = bio_sectors(bio);
|
|
|
|
|
|
|
|
if (nr_sectors && maxsector &&
|
|
|
|
(nr_sectors > maxsector ||
|
|
|
|
bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
|
|
|
|
handle_bad_sector(bio, maxsector);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-24 01:10:32 +08:00
|
|
|
/*
|
|
|
|
* Remap block n of partition p to block n+start(p) of the disk.
|
|
|
|
*/
|
|
|
|
static inline int blk_partition_remap(struct bio *bio)
|
|
|
|
{
|
|
|
|
struct hd_struct *p;
|
2018-03-14 23:56:53 +08:00
|
|
|
int ret = -EIO;
|
2017-08-24 01:10:32 +08:00
|
|
|
|
2018-01-11 21:09:11 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
p = __disk_get_part(bio->bi_disk, bio->bi_partno);
|
2018-03-14 23:56:53 +08:00
|
|
|
if (unlikely(!p))
|
|
|
|
goto out;
|
|
|
|
if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
|
|
|
|
goto out;
|
|
|
|
if (unlikely(bio_check_ro(bio, p)))
|
2018-01-11 21:09:11 +08:00
|
|
|
goto out;
|
|
|
|
|
2019-11-11 10:39:25 +08:00
|
|
|
if (bio_sectors(bio)) {
|
2018-03-14 23:56:53 +08:00
|
|
|
if (bio_check_eod(bio, part_nr_sects_read(p)))
|
|
|
|
goto out;
|
|
|
|
bio->bi_iter.bi_sector += p->start_sect;
|
|
|
|
trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
|
|
|
|
bio->bi_iter.bi_sector - p->start_sect);
|
|
|
|
}
|
2018-06-07 16:29:44 +08:00
|
|
|
bio->bi_partno = 0;
|
2018-03-14 23:56:53 +08:00
|
|
|
ret = 0;
|
2018-01-11 21:09:11 +08:00
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
2017-08-24 01:10:32 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-09-15 20:01:40 +08:00
|
|
|
static noinline_for_stack bool
|
|
|
|
generic_make_request_checks(struct bio *bio)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-07-24 15:28:11 +08:00
|
|
|
struct request_queue *q;
|
2011-09-12 18:12:01 +08:00
|
|
|
int nr_sectors = bio_sectors(bio);
|
2017-06-03 15:38:06 +08:00
|
|
|
blk_status_t status = BLK_STS_IOERR;
|
2011-09-12 18:12:01 +08:00
|
|
|
char b[BDEVNAME_SIZE];
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
might_sleep();
|
|
|
|
|
2017-08-24 01:10:32 +08:00
|
|
|
q = bio->bi_disk->queue;
|
2011-09-12 18:12:01 +08:00
|
|
|
if (unlikely(!q)) {
|
|
|
|
printk(KERN_ERR
|
|
|
|
"generic_make_request: Trying to access "
|
|
|
|
"nonexistent block-device %s (%Lu)\n",
|
2017-08-24 01:10:32 +08:00
|
|
|
bio_devname(bio, b), (long long)bio->bi_iter.bi_sector);
|
2011-09-12 18:12:01 +08:00
|
|
|
goto end_io;
|
|
|
|
}
|
2006-12-08 18:39:46 +08:00
|
|
|
|
2017-06-20 20:05:46 +08:00
|
|
|
/*
|
2020-05-29 03:19:29 +08:00
|
|
|
* For a REQ_NOWAIT based request, return -EOPNOTSUPP
|
|
|
|
* if queue is not a request based queue.
|
2017-06-20 20:05:46 +08:00
|
|
|
*/
|
2020-05-29 03:19:29 +08:00
|
|
|
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_mq(q))
|
|
|
|
goto not_supported;
|
2017-06-20 20:05:46 +08:00
|
|
|
|
2018-02-07 06:05:39 +08:00
|
|
|
if (should_fail_bio(bio))
|
2011-09-12 18:12:01 +08:00
|
|
|
goto end_io;
|
2006-03-24 03:00:26 +08:00
|
|
|
|
2018-03-14 23:56:53 +08:00
|
|
|
if (bio->bi_partno) {
|
|
|
|
if (unlikely(blk_partition_remap(bio)))
|
2018-01-11 21:09:11 +08:00
|
|
|
goto end_io;
|
|
|
|
} else {
|
2018-03-14 23:56:53 +08:00
|
|
|
if (unlikely(bio_check_ro(bio, &bio->bi_disk->part0)))
|
|
|
|
goto end_io;
|
|
|
|
if (unlikely(bio_check_eod(bio, get_capacity(bio->bi_disk))))
|
2018-01-11 21:09:11 +08:00
|
|
|
goto end_io;
|
|
|
|
}
|
2006-03-24 03:00:26 +08:00
|
|
|
|
2011-09-12 18:12:01 +08:00
|
|
|
/*
|
|
|
|
* Filter flush bio's early so that make_request based
|
|
|
|
* drivers without flush support don't have to worry
|
|
|
|
* about them.
|
|
|
|
*/
|
2017-01-28 00:08:23 +08:00
|
|
|
if (op_is_flush(bio->bi_opf) &&
|
2016-04-14 03:33:19 +08:00
|
|
|
!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
|
2016-08-06 05:35:16 +08:00
|
|
|
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
|
2011-09-12 18:12:01 +08:00
|
|
|
if (!nr_sectors) {
|
2017-06-03 15:38:06 +08:00
|
|
|
status = BLK_STS_OK;
|
2007-11-02 15:49:08 +08:00
|
|
|
goto end_io;
|
|
|
|
}
|
2011-09-12 18:12:01 +08:00
|
|
|
}
|
2006-10-31 14:07:21 +08:00
|
|
|
|
2018-12-15 00:21:22 +08:00
|
|
|
if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
|
|
|
|
bio->bi_opf &= ~REQ_HIPRI;
|
|
|
|
|
2016-06-09 22:00:36 +08:00
|
|
|
switch (bio_op(bio)) {
|
|
|
|
case REQ_OP_DISCARD:
|
|
|
|
if (!blk_queue_discard(q))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
|
|
|
case REQ_OP_SECURE_ERASE:
|
|
|
|
if (!blk_queue_secure_erase(q))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
|
|
|
case REQ_OP_WRITE_SAME:
|
2017-08-24 01:10:32 +08:00
|
|
|
if (!q->limits.max_write_same_sectors)
|
2016-06-09 22:00:36 +08:00
|
|
|
goto not_supported;
|
2016-12-04 21:56:39 +08:00
|
|
|
break;
|
2016-10-18 14:40:32 +08:00
|
|
|
case REQ_OP_ZONE_RESET:
|
2019-10-27 22:05:45 +08:00
|
|
|
case REQ_OP_ZONE_OPEN:
|
|
|
|
case REQ_OP_ZONE_CLOSE:
|
|
|
|
case REQ_OP_ZONE_FINISH:
|
2017-08-24 01:10:32 +08:00
|
|
|
if (!blk_queue_is_zoned(q))
|
2016-10-18 14:40:32 +08:00
|
|
|
goto not_supported;
|
2016-06-09 22:00:36 +08:00
|
|
|
break;
|
2019-08-02 01:26:36 +08:00
|
|
|
case REQ_OP_ZONE_RESET_ALL:
|
|
|
|
if (!blk_queue_is_zoned(q) || !blk_queue_zone_resetall(q))
|
|
|
|
goto not_supported;
|
|
|
|
break;
|
2016-12-01 04:28:59 +08:00
|
|
|
case REQ_OP_WRITE_ZEROES:
|
2017-08-24 01:10:32 +08:00
|
|
|
if (!q->limits.max_write_zeroes_sectors)
|
2016-12-01 04:28:59 +08:00
|
|
|
goto not_supported;
|
|
|
|
break;
|
2016-06-09 22:00:36 +08:00
|
|
|
default:
|
|
|
|
break;
|
2011-09-12 18:12:01 +08:00
|
|
|
}
|
2009-09-09 03:56:38 +08:00
|
|
|
|
2012-06-05 11:40:56 +08:00
|
|
|
/*
|
|
|
|
* Various block parts want %current->io_context and lazy ioc
|
|
|
|
* allocation ends up trading a lot of pain for a small amount of
|
|
|
|
* memory. Just allocate it upfront. This may fail and block
|
|
|
|
* layer knows how to live with it.
|
|
|
|
*/
|
|
|
|
create_io_context(GFP_ATOMIC, q->node);
|
|
|
|
|
2015-08-19 05:55:20 +08:00
|
|
|
if (!blkcg_bio_issue_check(q, bio))
|
|
|
|
return false;
|
2011-09-15 20:01:40 +08:00
|
|
|
|
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 23:40:52 +08:00
|
|
|
if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
|
|
|
|
trace_block_bio_queue(q, bio);
|
|
|
|
/* Now that enqueuing has been traced, we need to trace
|
|
|
|
* completion as well.
|
|
|
|
*/
|
|
|
|
bio_set_flag(bio, BIO_TRACE_COMPLETION);
|
|
|
|
}
|
2011-09-15 20:01:40 +08:00
|
|
|
return true;
|
2008-11-28 12:32:03 +08:00
|
|
|
|
2016-06-09 22:00:36 +08:00
|
|
|
not_supported:
|
2017-06-03 15:38:06 +08:00
|
|
|
status = BLK_STS_NOTSUPP;
|
2008-11-28 12:32:03 +08:00
|
|
|
end_io:
|
2017-06-03 15:38:06 +08:00
|
|
|
bio->bi_status = status;
|
2015-07-20 21:29:37 +08:00
|
|
|
bio_endio(bio);
|
2011-09-15 20:01:40 +08:00
|
|
|
return false;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2011-09-15 20:01:40 +08:00
|
|
|
/**
|
|
|
|
* generic_make_request - hand a buffer to its device driver for I/O
|
|
|
|
* @bio: The bio describing the location in memory and on the device.
|
|
|
|
*
|
|
|
|
* generic_make_request() is used to make I/O requests of block
|
|
|
|
* devices. It is passed a &struct bio, which describes the I/O that needs
|
|
|
|
* to be done.
|
|
|
|
*
|
|
|
|
* generic_make_request() does not return any status. The
|
|
|
|
* success/failure status of the request, along with notification of
|
|
|
|
* completion, is delivered asynchronously through the bio->bi_end_io
|
|
|
|
* function described (one day) else where.
|
|
|
|
*
|
|
|
|
* The caller of generic_make_request must make sure that bi_io_vec
|
|
|
|
* are set to describe the memory buffer, and that bi_dev and bi_sector are
|
|
|
|
* set to describe the device address, and the
|
|
|
|
* bi_end_io and optionally bi_private are set to describe how
|
|
|
|
* completion notification should be signaled.
|
|
|
|
*
|
|
|
|
* generic_make_request and the drivers it calls may use bi_next if this
|
|
|
|
* bio happens to be merged with someone else, and may resubmit the bio to
|
|
|
|
* a lower device by calling into generic_make_request recursively, which
|
|
|
|
* means the bio should NOT be touched after the call to ->make_request_fn.
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
*/
|
2015-11-06 01:41:16 +08:00
|
|
|
blk_qc_t generic_make_request(struct bio *bio)
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
{
|
2017-03-10 14:00:47 +08:00
|
|
|
/*
|
|
|
|
* bio_list_on_stack[0] contains bios submitted by the current
|
|
|
|
* make_request_fn.
|
|
|
|
* bio_list_on_stack[1] contains bios that were submitted before
|
|
|
|
* the current make_request_fn, but that haven't been processed
|
|
|
|
* yet.
|
|
|
|
*/
|
|
|
|
struct bio_list bio_list_on_stack[2];
|
2015-11-06 01:41:16 +08:00
|
|
|
blk_qc_t ret = BLK_QC_T_NONE;
|
2010-02-23 15:55:42 +08:00
|
|
|
|
2011-09-15 20:01:40 +08:00
|
|
|
if (!generic_make_request_checks(bio))
|
2015-11-06 01:41:16 +08:00
|
|
|
goto out;
|
2011-09-15 20:01:40 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We only want one ->make_request_fn to be active at a time, else
|
|
|
|
* stack usage with stacked devices could be a problem. So use
|
|
|
|
* current->bio_list to keep a list of requests submited by a
|
|
|
|
* make_request_fn function. current->bio_list is also used as a
|
|
|
|
* flag to say if generic_make_request is currently active in this
|
|
|
|
* task or not. If it is NULL, then no make_request is active. If
|
|
|
|
* it is non-NULL, then a make_request is active, and new requests
|
|
|
|
* should be added at the tail
|
|
|
|
*/
|
2010-02-23 15:55:42 +08:00
|
|
|
if (current->bio_list) {
|
2017-03-10 14:00:47 +08:00
|
|
|
bio_list_add(¤t->bio_list[0], bio);
|
2015-11-06 01:41:16 +08:00
|
|
|
goto out;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
}
|
2011-09-15 20:01:40 +08:00
|
|
|
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
/* following loop may be a bit non-obvious, and so deserves some
|
|
|
|
* explanation.
|
|
|
|
* Before entering the loop, bio->bi_next is NULL (as all callers
|
|
|
|
* ensure that) so we have a list with a single bio.
|
|
|
|
* We pretend that we have just taken it off a longer list, so
|
2010-02-23 15:55:42 +08:00
|
|
|
* we assign bio_list to a pointer to the bio_list_on_stack,
|
|
|
|
* thus initialising the bio_list of new bios to be
|
2011-09-15 20:01:40 +08:00
|
|
|
* added. ->make_request() may indeed add some more bios
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
* through a recursive call to generic_make_request. If it
|
|
|
|
* did, we find a non-NULL value in bio_list and re-enter the loop
|
|
|
|
* from the top. In this case we really did just take the bio
|
2010-02-23 15:55:42 +08:00
|
|
|
* of the top of the list (no pretending) and so remove it from
|
2011-09-15 20:01:40 +08:00
|
|
|
* bio_list, and call into ->make_request() again.
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
*/
|
|
|
|
BUG_ON(bio->bi_next);
|
2017-03-10 14:00:47 +08:00
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
|
|
|
current->bio_list = bio_list_on_stack;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
do {
|
2019-05-15 11:03:09 +08:00
|
|
|
struct request_queue *q = bio->bi_disk->queue;
|
|
|
|
blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
|
|
|
|
BLK_MQ_REQ_NOWAIT : 0;
|
2011-09-15 20:01:40 +08:00
|
|
|
|
2019-05-15 11:03:09 +08:00
|
|
|
if (likely(blk_queue_enter(q, flags) == 0)) {
|
blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 04:38:05 +08:00
|
|
|
struct bio_list lower, same;
|
|
|
|
|
|
|
|
/* Create a fresh bio_list for all subordinate requests */
|
2017-03-10 14:00:47 +08:00
|
|
|
bio_list_on_stack[1] = bio_list_on_stack[0];
|
|
|
|
bio_list_init(&bio_list_on_stack[0]);
|
2015-11-06 01:41:16 +08:00
|
|
|
ret = q->make_request_fn(q, bio);
|
2015-10-22 01:20:12 +08:00
|
|
|
|
2019-05-15 11:03:09 +08:00
|
|
|
blk_queue_exit(q);
|
|
|
|
|
blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 04:38:05 +08:00
|
|
|
/* sort new bios into those for a lower level
|
|
|
|
* and those for the same level
|
|
|
|
*/
|
|
|
|
bio_list_init(&lower);
|
|
|
|
bio_list_init(&same);
|
2017-03-10 14:00:47 +08:00
|
|
|
while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
|
2017-08-24 01:10:32 +08:00
|
|
|
if (q == bio->bi_disk->queue)
|
blk: improve order of bio handling in generic_make_request()
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 04:38:05 +08:00
|
|
|
bio_list_add(&same, bio);
|
|
|
|
else
|
|
|
|
bio_list_add(&lower, bio);
|
|
|
|
/* now assemble so we handle the lowest level first */
|
2017-03-10 14:00:47 +08:00
|
|
|
bio_list_merge(&bio_list_on_stack[0], &lower);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &same);
|
|
|
|
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
|
2015-10-22 01:20:12 +08:00
|
|
|
} else {
|
2017-06-20 20:05:46 +08:00
|
|
|
if (unlikely(!blk_queue_dying(q) &&
|
|
|
|
(bio->bi_opf & REQ_NOWAIT)))
|
|
|
|
bio_wouldblock_error(bio);
|
|
|
|
else
|
|
|
|
bio_io_error(bio);
|
2015-10-22 01:20:12 +08:00
|
|
|
}
|
2017-03-10 14:00:47 +08:00
|
|
|
bio = bio_list_pop(&bio_list_on_stack[0]);
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
} while (bio);
|
2010-02-23 15:55:42 +08:00
|
|
|
current->bio_list = NULL; /* deactivate */
|
2015-11-06 01:41:16 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-01 15:53:42 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
EXPORT_SYMBOL(generic_make_request);
|
|
|
|
|
2017-11-03 02:29:50 +08:00
|
|
|
/**
|
|
|
|
* direct_make_request - hand a buffer directly to its device driver for I/O
|
|
|
|
* @bio: The bio describing the location in memory and on the device.
|
|
|
|
*
|
|
|
|
* This function behaves like generic_make_request(), but does not protect
|
|
|
|
* against recursion. Must only be used if the called driver is known
|
|
|
|
* to not call generic_make_request (or direct_make_request) again from
|
|
|
|
* its make_request function. (Calling direct_make_request again from
|
|
|
|
* a workqueue is perfectly fine as that doesn't recurse).
|
|
|
|
*/
|
|
|
|
blk_qc_t direct_make_request(struct bio *bio)
|
|
|
|
{
|
|
|
|
struct request_queue *q = bio->bi_disk->queue;
|
|
|
|
bool nowait = bio->bi_opf & REQ_NOWAIT;
|
|
|
|
blk_qc_t ret;
|
|
|
|
|
|
|
|
if (!generic_make_request_checks(bio))
|
|
|
|
return BLK_QC_T_NONE;
|
|
|
|
|
2017-11-10 02:49:58 +08:00
|
|
|
if (unlikely(blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0))) {
|
2017-11-03 02:29:50 +08:00
|
|
|
if (nowait && !blk_queue_dying(q))
|
2020-03-10 05:41:34 +08:00
|
|
|
bio_wouldblock_error(bio);
|
2017-11-03 02:29:50 +08:00
|
|
|
else
|
2020-03-10 05:41:34 +08:00
|
|
|
bio_io_error(bio);
|
2017-11-03 02:29:50 +08:00
|
|
|
return BLK_QC_T_NONE;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = q->make_request_fn(q, bio);
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(direct_make_request);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/**
|
2008-08-20 02:13:11 +08:00
|
|
|
* submit_bio - submit a bio to the block device layer for I/O
|
2005-04-17 06:20:36 +08:00
|
|
|
* @bio: The &struct bio which describes the I/O
|
|
|
|
*
|
|
|
|
* submit_bio() is very similar in purpose to generic_make_request(), and
|
|
|
|
* uses that function to do most of the work. Both are fairly rough
|
2008-08-20 02:13:11 +08:00
|
|
|
* interfaces; @bio must be presetup and ready for I/O.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
|
|
|
*/
|
2016-06-06 03:31:41 +08:00
|
|
|
blk_qc_t submit_bio(struct bio *bio)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2019-08-09 03:03:00 +08:00
|
|
|
bool workingset_read = false;
|
|
|
|
unsigned long pflags;
|
|
|
|
blk_qc_t ret;
|
|
|
|
|
2019-06-28 04:39:52 +08:00
|
|
|
if (blkcg_punt_bio_submit(bio))
|
|
|
|
return BLK_QC_T_NONE;
|
|
|
|
|
2007-09-27 19:01:25 +08:00
|
|
|
/*
|
|
|
|
* If it's a regular read/write or a barrier with data attached,
|
|
|
|
* go through the normal accounting stuff before submission.
|
|
|
|
*/
|
2012-09-19 00:19:25 +08:00
|
|
|
if (bio_has_data(bio)) {
|
2012-09-19 00:19:27 +08:00
|
|
|
unsigned int count;
|
|
|
|
|
2016-06-06 03:31:48 +08:00
|
|
|
if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
|
2018-02-27 20:10:03 +08:00
|
|
|
count = queue_logical_block_size(bio->bi_disk->queue) >> 9;
|
2012-09-19 00:19:27 +08:00
|
|
|
else
|
|
|
|
count = bio_sectors(bio);
|
|
|
|
|
2016-06-06 03:31:45 +08:00
|
|
|
if (op_is_write(bio_op(bio))) {
|
2007-09-27 19:01:25 +08:00
|
|
|
count_vm_events(PGPGOUT, count);
|
|
|
|
} else {
|
2019-08-09 03:03:00 +08:00
|
|
|
if (bio_flagged(bio, BIO_WORKINGSET))
|
|
|
|
workingset_read = true;
|
2013-10-12 06:44:27 +08:00
|
|
|
task_io_account_read(bio->bi_iter.bi_size);
|
2007-09-27 19:01:25 +08:00
|
|
|
count_vm_events(PGPGIN, count);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(block_dump)) {
|
|
|
|
char b[BDEVNAME_SIZE];
|
2010-09-14 14:48:01 +08:00
|
|
|
printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n",
|
2007-10-19 14:40:40 +08:00
|
|
|
current->comm, task_pid_nr(current),
|
2016-06-06 03:31:45 +08:00
|
|
|
op_is_write(bio_op(bio)) ? "WRITE" : "READ",
|
2013-10-12 06:44:27 +08:00
|
|
|
(unsigned long long)bio->bi_iter.bi_sector,
|
2017-08-24 01:10:32 +08:00
|
|
|
bio_devname(bio, b), count);
|
2007-09-27 19:01:25 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2019-08-09 03:03:00 +08:00
|
|
|
/*
|
|
|
|
* If we're reading data that is part of the userspace
|
|
|
|
* workingset, count submission time as memory stall. When the
|
|
|
|
* device is congested, or the submitting cgroup IO-throttled,
|
|
|
|
* submission can be a significant part of overall IO time.
|
|
|
|
*/
|
|
|
|
if (workingset_read)
|
|
|
|
psi_memstall_enter(&pflags);
|
|
|
|
|
|
|
|
ret = generic_make_request(bio);
|
|
|
|
|
|
|
|
if (workingset_read)
|
|
|
|
psi_memstall_leave(&pflags);
|
|
|
|
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(submit_bio);
|
|
|
|
|
2008-09-18 22:45:38 +08:00
|
|
|
/**
|
2015-11-26 15:46:57 +08:00
|
|
|
* blk_cloned_rq_check_limits - Helper function to check a cloned request
|
2020-03-10 05:41:33 +08:00
|
|
|
* for the new queue limits
|
2008-09-18 22:45:38 +08:00
|
|
|
* @q: the queue
|
|
|
|
* @rq: the request being checked
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* @rq may have been made based on weaker limitations of upper-level queues
|
|
|
|
* in request stacking drivers, and it may violate the limitation of @q.
|
|
|
|
* Since the block layer and the underlying device driver trust @rq
|
|
|
|
* after it is inserted to @q, it should be checked against @q before
|
|
|
|
* the insertion using this generic function.
|
|
|
|
*
|
|
|
|
* Request stacking drivers like request-based dm may change the queue
|
2015-11-26 15:46:57 +08:00
|
|
|
* limits when retrying requests on other queues. Those requests need
|
|
|
|
* to be checked against the new queue limits again during dispatch.
|
2008-09-18 22:45:38 +08:00
|
|
|
*/
|
2015-11-26 15:46:57 +08:00
|
|
|
static int blk_cloned_rq_check_limits(struct request_queue *q,
|
|
|
|
struct request *rq)
|
2008-09-18 22:45:38 +08:00
|
|
|
{
|
2016-06-06 03:32:15 +08:00
|
|
|
if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
|
2019-05-24 05:49:39 +08:00
|
|
|
printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
|
|
|
|
__func__, blk_rq_sectors(rq),
|
|
|
|
blk_queue_get_max_sectors(q, req_op(rq)));
|
2008-09-18 22:45:38 +08:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* queue's settings related to segment counting like q->bounce_pfn
|
|
|
|
* may differ from that of other stacking queues.
|
|
|
|
* Recalculate it to check the request correctly on this queue's
|
|
|
|
* limitation.
|
|
|
|
*/
|
2019-06-06 18:29:02 +08:00
|
|
|
rq->nr_phys_segments = blk_recalc_rq_segments(rq);
|
2010-02-26 13:20:39 +08:00
|
|
|
if (rq->nr_phys_segments > queue_max_segments(q)) {
|
2019-05-24 05:49:39 +08:00
|
|
|
printk(KERN_ERR "%s: over max segments limit. (%hu > %hu)\n",
|
|
|
|
__func__, rq->nr_phys_segments, queue_max_segments(q));
|
2008-09-18 22:45:38 +08:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_insert_cloned_request - Helper for stacking drivers to submit a request
|
|
|
|
* @q: the queue to submit the request
|
|
|
|
* @rq: the request being queued
|
|
|
|
*/
|
2017-06-03 15:38:04 +08:00
|
|
|
blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq)
|
2008-09-18 22:45:38 +08:00
|
|
|
{
|
2015-11-26 15:46:57 +08:00
|
|
|
if (blk_cloned_rq_check_limits(q, rq))
|
2017-06-03 15:38:04 +08:00
|
|
|
return BLK_STS_IOERR;
|
2008-09-18 22:45:38 +08:00
|
|
|
|
2011-07-27 07:09:03 +08:00
|
|
|
if (rq->rq_disk &&
|
|
|
|
should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
|
2017-06-03 15:38:04 +08:00
|
|
|
return BLK_STS_IOERR;
|
2008-09-18 22:45:38 +08:00
|
|
|
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
if (blk_queue_io_stat(q))
|
|
|
|
blk_account_io_start(rq, true);
|
2008-09-18 22:45:38 +08:00
|
|
|
|
|
|
|
/*
|
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-30 00:23:51 +08:00
|
|
|
* Since we have a scheduler attached on the top device,
|
|
|
|
* bypass a potential scheduler on the bottom device for
|
|
|
|
* insert.
|
2008-09-18 22:45:38 +08:00
|
|
|
*/
|
2019-04-05 01:08:43 +08:00
|
|
|
return blk_mq_request_issue_directly(rq, true);
|
2008-09-18 22:45:38 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
|
|
|
|
|
2009-07-03 16:48:17 +08:00
|
|
|
/**
|
|
|
|
* blk_rq_err_bytes - determine number of bytes till the next failure boundary
|
|
|
|
* @rq: request to examine
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* A request could be merge of IOs which require different failure
|
|
|
|
* handling. This function determines the number of bytes which
|
|
|
|
* can be failed from the beginning of the request without
|
|
|
|
* crossing into area which need to be retried further.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* The number of bytes to fail.
|
|
|
|
*/
|
|
|
|
unsigned int blk_rq_err_bytes(const struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned int ff = rq->cmd_flags & REQ_FAILFAST_MASK;
|
|
|
|
unsigned int bytes = 0;
|
|
|
|
struct bio *bio;
|
|
|
|
|
2016-10-20 21:12:13 +08:00
|
|
|
if (!(rq->rq_flags & RQF_MIXED_MERGE))
|
2009-07-03 16:48:17 +08:00
|
|
|
return blk_rq_bytes(rq);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Currently the only 'mixing' which can happen is between
|
|
|
|
* different fastfail types. We can safely fail portions
|
|
|
|
* which have all the failfast bits that the first one has -
|
|
|
|
* the ones which are at least as eager to fail as the first
|
|
|
|
* one.
|
|
|
|
*/
|
|
|
|
for (bio = rq->bio; bio; bio = bio->bi_next) {
|
2016-08-06 05:35:16 +08:00
|
|
|
if ((bio->bi_opf & ff) != ff)
|
2009-07-03 16:48:17 +08:00
|
|
|
break;
|
2013-10-12 06:44:27 +08:00
|
|
|
bytes += bio->bi_iter.bi_size;
|
2009-07-03 16:48:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* this could lead to infinite loop */
|
|
|
|
BUG_ON(blk_rq_bytes(rq) && !bytes);
|
|
|
|
return bytes;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
void blk_account_io_completion(struct request *req, unsigned int bytes)
|
2009-01-23 17:54:44 +08:00
|
|
|
{
|
2019-12-11 02:47:04 +08:00
|
|
|
if (req->part && blk_do_io_stat(req)) {
|
2018-07-18 19:47:39 +08:00
|
|
|
const int sgrp = op_stat_group(req_op(req));
|
2009-01-23 17:54:44 +08:00
|
|
|
struct hd_struct *part;
|
|
|
|
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_lock();
|
2011-01-05 23:57:38 +08:00
|
|
|
part = req->part;
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_add(part, sectors[sgrp], bytes >> 9);
|
2009-01-23 17:54:44 +08:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 17:08:53 +08:00
|
|
|
void blk_account_io_done(struct request *req, u64 now)
|
2009-01-23 17:54:44 +08:00
|
|
|
{
|
|
|
|
/*
|
2010-09-03 17:56:16 +08:00
|
|
|
* Account IO completion. flush_rq isn't accounted as a
|
|
|
|
* normal IO on queueing nor completion. Accounting the
|
|
|
|
* containing request is enough.
|
2009-01-23 17:54:44 +08:00
|
|
|
*/
|
2019-12-11 02:47:04 +08:00
|
|
|
if (req->part && blk_do_io_stat(req) &&
|
|
|
|
!(req->rq_flags & RQF_FLUSH_SEQ)) {
|
2018-07-18 19:47:39 +08:00
|
|
|
const int sgrp = op_stat_group(req_op(req));
|
2009-01-23 17:54:44 +08:00
|
|
|
struct hd_struct *part;
|
|
|
|
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_lock();
|
2011-01-05 23:57:38 +08:00
|
|
|
part = req->part;
|
2009-01-23 17:54:44 +08:00
|
|
|
|
block/diskstats: more accurate approximation of io_ticks for slow disks
Currently io_ticks is approximated by adding one at each start and end of
requests if jiffies counter has changed. This works perfectly for requests
shorter than a jiffy or if one of requests starts/ends at each jiffy.
If disk executes just one request at a time and they are longer than two
jiffies then only first and last jiffies will be accounted.
Fix is simple: at the end of request add up into io_ticks jiffies passed
since last update rather than just one jiffy.
Example: common HDD executes random read 4k requests around 12ms.
fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
iostat -x 10 sdb
Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
Before:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43
After:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99
Now io_ticks does not loose time between start and end of requests, but
for queue-depth > 1 some I/O time between adjacent starts might be lost.
For load estimation "%util" is not as useful as average queue length,
but it clearly shows how often disk queue is completely empty.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 21:07:04 +08:00
|
|
|
update_io_ticks(part, jiffies, true);
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_inc(part, ios[sgrp]);
|
|
|
|
part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
|
2018-07-18 19:47:39 +08:00
|
|
|
part_dec_in_flight(req->q, part, rq_data_dir(req));
|
2009-01-23 17:54:44 +08:00
|
|
|
|
2011-01-07 15:43:37 +08:00
|
|
|
hd_struct_put(part);
|
2009-01-23 17:54:44 +08:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
void blk_account_io_start(struct request *rq, bool new_io)
|
|
|
|
{
|
|
|
|
struct hd_struct *part;
|
|
|
|
int rw = rq_data_dir(rq);
|
|
|
|
|
|
|
|
if (!blk_do_io_stat(rq))
|
|
|
|
return;
|
|
|
|
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_lock();
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
if (!new_io) {
|
|
|
|
part = rq->part;
|
2018-12-07 00:41:18 +08:00
|
|
|
part_stat_inc(part, merges[rw]);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
} else {
|
|
|
|
part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
|
|
|
|
if (!hd_struct_try_get(part)) {
|
|
|
|
/*
|
|
|
|
* The partition is already being removed,
|
|
|
|
* the request will be accounted on the disk only
|
|
|
|
*
|
|
|
|
* We take a reference on disk->part0 although that
|
|
|
|
* partition will never be deleted, so we can treat
|
|
|
|
* it as any other partition.
|
|
|
|
*/
|
|
|
|
part = &rq->rq_disk->part0;
|
|
|
|
hd_struct_get(part);
|
|
|
|
}
|
2017-07-01 11:55:08 +08:00
|
|
|
part_inc_in_flight(rq->q, part, rw);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
rq->part = part;
|
|
|
|
}
|
|
|
|
|
block/diskstats: more accurate approximation of io_ticks for slow disks
Currently io_ticks is approximated by adding one at each start and end of
requests if jiffies counter has changed. This works perfectly for requests
shorter than a jiffy or if one of requests starts/ends at each jiffy.
If disk executes just one request at a time and they are longer than two
jiffies then only first and last jiffies will be accounted.
Fix is simple: at the end of request add up into io_ticks jiffies passed
since last update rather than just one jiffy.
Example: common HDD executes random read 4k requests around 12ms.
fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
iostat -x 10 sdb
Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
Before:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43
After:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99
Now io_ticks does not loose time between start and end of requests, but
for queue-depth > 1 some I/O time between adjacent starts might be lost.
For load estimation "%util" is not as useful as average queue length,
but it clearly shows how often disk queue is completely empty.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 21:07:04 +08:00
|
|
|
update_io_ticks(part, jiffies, false);
|
2018-12-07 00:41:19 +08:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
part_stat_unlock();
|
|
|
|
}
|
|
|
|
|
2017-11-03 02:29:51 +08:00
|
|
|
/*
|
|
|
|
* Steal bios from a request and add them to a bio list.
|
|
|
|
* The request must not have been partially completed before.
|
|
|
|
*/
|
|
|
|
void blk_steal_bios(struct bio_list *list, struct request *rq)
|
|
|
|
{
|
|
|
|
if (rq->bio) {
|
|
|
|
if (list->tail)
|
|
|
|
list->tail->bi_next = rq->bio;
|
|
|
|
else
|
|
|
|
list->head = rq->bio;
|
|
|
|
list->tail = rq->biotail;
|
|
|
|
|
|
|
|
rq->bio = NULL;
|
|
|
|
rq->biotail = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq->__data_len = 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_steal_bios);
|
|
|
|
|
2007-12-12 06:52:28 +08:00
|
|
|
/**
|
2009-04-23 10:05:18 +08:00
|
|
|
* blk_update_request - Special helper function for request stacking drivers
|
2009-06-12 11:00:41 +08:00
|
|
|
* @req: the request being processed
|
2017-06-03 15:38:04 +08:00
|
|
|
* @error: block status code
|
2009-06-12 11:00:41 +08:00
|
|
|
* @nr_bytes: number of bytes to complete @req
|
2007-12-12 06:52:28 +08:00
|
|
|
*
|
|
|
|
* Description:
|
2009-06-12 11:00:41 +08:00
|
|
|
* Ends I/O on a number of bytes attached to @req, but doesn't complete
|
|
|
|
* the request structure even if @req doesn't have leftover.
|
|
|
|
* If @req has leftover, sets it up for the next range of segments.
|
2009-04-23 10:05:18 +08:00
|
|
|
*
|
|
|
|
* This special helper function is only for request stacking drivers
|
|
|
|
* (e.g. request-based dm) so that they can handle partial completion.
|
2019-05-23 23:43:11 +08:00
|
|
|
* Actual device drivers should use blk_mq_end_request instead.
|
2009-04-23 10:05:18 +08:00
|
|
|
*
|
|
|
|
* Passing the result of blk_rq_bytes() as @nr_bytes guarantees
|
|
|
|
* %false return from this function.
|
2007-12-12 06:52:28 +08:00
|
|
|
*
|
2018-06-28 04:09:05 +08:00
|
|
|
* Note:
|
|
|
|
* The RQF_SPECIAL_PAYLOAD flag is ignored on purpose in both
|
|
|
|
* blk_rq_bytes() and in blk_update_request().
|
|
|
|
*
|
2007-12-12 06:52:28 +08:00
|
|
|
* Return:
|
2009-04-23 10:05:18 +08:00
|
|
|
* %false - this request doesn't have any more data
|
|
|
|
* %true - this request has more data
|
2007-12-12 06:52:28 +08:00
|
|
|
**/
|
2017-06-03 15:38:04 +08:00
|
|
|
bool blk_update_request(struct request *req, blk_status_t error,
|
|
|
|
unsigned int nr_bytes)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2012-09-21 07:38:30 +08:00
|
|
|
int total_bytes;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
trace_block_rq_complete(req, blk_status_to_errno(error), nr_bytes);
|
2014-10-01 20:32:31 +08:00
|
|
|
|
2009-04-23 10:05:18 +08:00
|
|
|
if (!req->bio)
|
|
|
|
return false;
|
|
|
|
|
2019-09-16 23:44:29 +08:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ &&
|
|
|
|
error == BLK_STS_OK)
|
|
|
|
req->q->integrity.profile->complete_fn(req, nr_bytes);
|
|
|
|
#endif
|
|
|
|
|
2017-06-03 15:38:04 +08:00
|
|
|
if (unlikely(error && !blk_rq_is_passthrough(req) &&
|
|
|
|
!(req->rq_flags & RQF_QUIET)))
|
2019-06-21 01:59:15 +08:00
|
|
|
print_req_error(req, error, __func__);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-01-23 17:54:44 +08:00
|
|
|
blk_account_io_completion(req, nr_bytes);
|
2005-11-01 15:35:42 +08:00
|
|
|
|
2012-09-21 07:38:30 +08:00
|
|
|
total_bytes = 0;
|
|
|
|
while (req->bio) {
|
|
|
|
struct bio *bio = req->bio;
|
2013-10-12 06:44:27 +08:00
|
|
|
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-06-20 01:26:40 +08:00
|
|
|
if (bio_bytes == bio->bi_iter.bi_size)
|
2005-04-17 06:20:36 +08:00
|
|
|
req->bio = bio->bi_next;
|
|
|
|
|
block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 23:40:52 +08:00
|
|
|
/* Completion has already been traced */
|
|
|
|
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
|
2012-09-21 07:38:30 +08:00
|
|
|
req_bio_endio(req, bio, bio_bytes, error);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-09-21 07:38:30 +08:00
|
|
|
total_bytes += bio_bytes;
|
|
|
|
nr_bytes -= bio_bytes;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-09-21 07:38:30 +08:00
|
|
|
if (!nr_bytes)
|
|
|
|
break;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* completely done
|
|
|
|
*/
|
2009-04-23 10:05:18 +08:00
|
|
|
if (!req->bio) {
|
|
|
|
/*
|
|
|
|
* Reset counters so that the request stacking driver
|
|
|
|
* can find how many bytes remain in the request
|
|
|
|
* later.
|
|
|
|
*/
|
2009-05-07 21:24:44 +08:00
|
|
|
req->__data_len = 0;
|
2009-04-23 10:05:18 +08:00
|
|
|
return false;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-05-07 21:24:44 +08:00
|
|
|
req->__data_len -= total_bytes;
|
2009-05-07 21:24:41 +08:00
|
|
|
|
|
|
|
/* update sector only for requests with clear definition of sector */
|
2017-01-31 23:57:29 +08:00
|
|
|
if (!blk_rq_is_passthrough(req))
|
2009-05-07 21:24:44 +08:00
|
|
|
req->__sector += total_bytes >> 9;
|
2009-05-07 21:24:41 +08:00
|
|
|
|
2009-07-03 16:48:17 +08:00
|
|
|
/* mixed attributes always follow the first bio */
|
2016-10-20 21:12:13 +08:00
|
|
|
if (req->rq_flags & RQF_MIXED_MERGE) {
|
2009-07-03 16:48:17 +08:00
|
|
|
req->cmd_flags &= ~REQ_FAILFAST_MASK;
|
2016-08-06 05:35:16 +08:00
|
|
|
req->cmd_flags |= req->bio->bi_opf & REQ_FAILFAST_MASK;
|
2009-07-03 16:48:17 +08:00
|
|
|
}
|
|
|
|
|
2017-05-11 18:34:38 +08:00
|
|
|
if (!(req->rq_flags & RQF_SPECIAL_PAYLOAD)) {
|
|
|
|
/*
|
|
|
|
* If total number of sectors is less than the first segment
|
|
|
|
* size, something has gone terribly wrong.
|
|
|
|
*/
|
|
|
|
if (blk_rq_bytes(req) < blk_rq_cur_bytes(req)) {
|
|
|
|
blk_dump_rq_flags(req, "request botched");
|
|
|
|
req->__data_len = blk_rq_cur_bytes(req);
|
|
|
|
}
|
2009-05-07 21:24:41 +08:00
|
|
|
|
2017-05-11 18:34:38 +08:00
|
|
|
/* recalculate the number of segments */
|
2019-06-06 18:29:02 +08:00
|
|
|
req->nr_phys_segments = blk_recalc_rq_segments(req);
|
2017-05-11 18:34:38 +08:00
|
|
|
}
|
2009-05-07 21:24:41 +08:00
|
|
|
|
2009-04-23 10:05:18 +08:00
|
|
|
return true;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2009-04-23 10:05:18 +08:00
|
|
|
EXPORT_SYMBOL_GPL(blk_update_request);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-11-26 16:16:19 +08:00
|
|
|
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
|
|
|
|
/**
|
|
|
|
* rq_flush_dcache_pages - Helper function to flush all pages in a request
|
|
|
|
* @rq: the request to be flushed
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Flush all pages in @rq.
|
|
|
|
*/
|
|
|
|
void rq_flush_dcache_pages(struct request *rq)
|
|
|
|
{
|
|
|
|
struct req_iterator iter;
|
2013-11-24 09:19:00 +08:00
|
|
|
struct bio_vec bvec;
|
2009-11-26 16:16:19 +08:00
|
|
|
|
|
|
|
rq_for_each_segment(bvec, rq, iter)
|
2013-11-24 09:19:00 +08:00
|
|
|
flush_dcache_page(bvec.bv_page);
|
2009-11-26 16:16:19 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
|
|
|
|
#endif
|
|
|
|
|
2008-10-01 22:12:15 +08:00
|
|
|
/**
|
|
|
|
* blk_lld_busy - Check if underlying low-level drivers of a device are busy
|
|
|
|
* @q : the queue of the device being checked
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Check if underlying low-level drivers of a device are busy.
|
|
|
|
* If the drivers want to export their busy state, they must set own
|
|
|
|
* exporting function using blk_queue_lld_busy() first.
|
|
|
|
*
|
|
|
|
* Basically, this function is used only by request stacking drivers
|
|
|
|
* to stop dispatching requests to underlying devices when underlying
|
|
|
|
* devices are busy. This behavior helps more I/O merging on the queue
|
|
|
|
* of the request stacking driver and prevents I/O throughput regression
|
|
|
|
* on burst I/O load.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* 0 - Not busy (The request stacking driver should dispatch request)
|
|
|
|
* 1 - Busy (The request stacking driver should stop dispatching request)
|
|
|
|
*/
|
|
|
|
int blk_lld_busy(struct request_queue *q)
|
|
|
|
{
|
2018-11-16 03:22:51 +08:00
|
|
|
if (queue_is_mq(q) && q->mq_ops->busy)
|
2018-10-30 00:15:10 +08:00
|
|
|
return q->mq_ops->busy(q);
|
2008-10-01 22:12:15 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_lld_busy);
|
|
|
|
|
2015-06-26 22:01:13 +08:00
|
|
|
/**
|
|
|
|
* blk_rq_unprep_clone - Helper function to free all bios in a cloned request
|
|
|
|
* @rq: the clone request to be cleaned up
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Free all bios in @rq for a cloned request.
|
|
|
|
*/
|
|
|
|
void blk_rq_unprep_clone(struct request *rq)
|
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
|
|
|
|
while ((bio = rq->bio) != NULL) {
|
|
|
|
rq->bio = bio->bi_next;
|
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_rq_prep_clone - Helper function to setup clone request
|
|
|
|
* @rq: the request to be setup
|
|
|
|
* @rq_src: original request to be cloned
|
|
|
|
* @bs: bio_set that bios for clone are allocated from
|
|
|
|
* @gfp_mask: memory allocation mask for bio
|
|
|
|
* @bio_ctr: setup function to be called for each clone bio.
|
|
|
|
* Returns %0 for success, non %0 for failure.
|
|
|
|
* @data: private data to be passed to @bio_ctr
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
|
|
|
|
* Also, pages which the original bios are pointing to are not copied
|
|
|
|
* and the cloned bios just point same pages.
|
|
|
|
* So cloned bios must be completed before original bios, which means
|
|
|
|
* the caller must complete @rq before @rq_src.
|
|
|
|
*/
|
|
|
|
int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
|
|
|
|
struct bio_set *bs, gfp_t gfp_mask,
|
|
|
|
int (*bio_ctr)(struct bio *, struct bio *, void *),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct bio *bio, *bio_src;
|
|
|
|
|
|
|
|
if (!bs)
|
2018-05-09 09:33:52 +08:00
|
|
|
bs = &fs_bio_set;
|
2015-06-26 22:01:13 +08:00
|
|
|
|
|
|
|
__rq_for_each_bio(bio_src, rq_src) {
|
|
|
|
bio = bio_clone_fast(bio_src, gfp_mask, bs);
|
|
|
|
if (!bio)
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (bio_ctr && bio_ctr(bio, bio_src, data))
|
|
|
|
goto free_and_out;
|
|
|
|
|
|
|
|
if (rq->bio) {
|
|
|
|
rq->biotail->bi_next = bio;
|
|
|
|
rq->biotail = bio;
|
|
|
|
} else
|
|
|
|
rq->bio = rq->biotail = bio;
|
|
|
|
}
|
|
|
|
|
2020-03-10 05:41:36 +08:00
|
|
|
/* Copy attributes of the original request to the clone request. */
|
|
|
|
rq->__sector = blk_rq_pos(rq_src);
|
|
|
|
rq->__data_len = blk_rq_bytes(rq_src);
|
|
|
|
if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
|
|
|
|
rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
|
|
|
|
rq->special_vec = rq_src->special_vec;
|
|
|
|
}
|
|
|
|
rq->nr_phys_segments = rq_src->nr_phys_segments;
|
|
|
|
rq->ioprio = rq_src->ioprio;
|
|
|
|
rq->extra_len = rq_src->extra_len;
|
2015-06-26 22:01:13 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_and_out:
|
|
|
|
if (bio)
|
|
|
|
bio_put(bio);
|
|
|
|
blk_rq_unprep_clone(rq);
|
|
|
|
|
|
|
|
return -ENOMEM;
|
block: add request clone interface (v2)
This patch adds the following 2 interfaces for request-stacking drivers:
- blk_rq_prep_clone(struct request *clone, struct request *orig,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio*, void *),
void *data)
* Clones bios in the original request to the clone request
(bio_ctr is called for each cloned bios.)
* Copies attributes of the original request to the clone request.
The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not
copied.
- blk_rq_unprep_clone(struct request *clone)
* Frees cloned bios from the clone request.
Request stacking drivers (e.g. request-based dm) need to make a clone
request for a submitted request and dispatch it to other devices.
To allocate request for the clone, request stacking drivers may not
be able to use blk_get_request() because the allocation may be done
in an irq-disabled context.
So blk_rq_prep_clone() takes a request allocated by the caller
as an argument.
For each clone bio in the clone request, request stacking drivers
should be able to set up their own completion handler.
So blk_rq_prep_clone() takes a callback function which is called
for each clone bio, and a pointer for private data which is passed
to the callback.
NOTE:
blk_rq_prep_clone() doesn't copy any actual data of the original
request. Pages are shared between original bios and cloned bios.
So caller must not complete the original request before the clone
request.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-11 19:10:16 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
|
|
|
|
|
2014-04-08 23:15:35 +08:00
|
|
|
int kblockd_schedule_work(struct work_struct *work)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return queue_work(kblockd_workqueue, work);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_schedule_work);
|
|
|
|
|
2017-04-10 23:54:55 +08:00
|
|
|
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
|
|
|
|
unsigned long delay)
|
|
|
|
{
|
|
|
|
return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
|
|
|
|
|
2011-09-21 16:00:16 +08:00
|
|
|
/**
|
|
|
|
* blk_start_plug - initialize blk_plug and track it inside the task_struct
|
|
|
|
* @plug: The &struct blk_plug that needs to be initialized
|
|
|
|
*
|
|
|
|
* Description:
|
2019-01-09 05:57:34 +08:00
|
|
|
* blk_start_plug() indicates to the block layer an intent by the caller
|
|
|
|
* to submit multiple I/O requests in a batch. The block layer may use
|
|
|
|
* this hint to defer submitting I/Os from the caller until blk_finish_plug()
|
|
|
|
* is called. However, the block layer may choose to submit requests
|
|
|
|
* before a call to blk_finish_plug() if the number of queued I/Os
|
|
|
|
* exceeds %BLK_MAX_REQUEST_COUNT, or if the size of the I/O is larger than
|
|
|
|
* %BLK_PLUG_FLUSH_SIZE. The queued I/Os may also be submitted early if
|
|
|
|
* the task schedules (see below).
|
|
|
|
*
|
2011-09-21 16:00:16 +08:00
|
|
|
* Tracking blk_plug inside the task_struct will help with auto-flushing the
|
|
|
|
* pending I/O should the task end up blocking between blk_start_plug() and
|
|
|
|
* blk_finish_plug(). This is important from a performance perspective, but
|
|
|
|
* also ensures that we don't deadlock. For instance, if the task is blocking
|
|
|
|
* for a memory allocation, memory reclaim could end up wanting to free a
|
|
|
|
* page belonging to that request that is currently residing in our private
|
|
|
|
* plug. By flushing the pending I/O when the process goes to sleep, we avoid
|
|
|
|
* this kind of deadlock.
|
|
|
|
*/
|
2011-03-08 20:19:51 +08:00
|
|
|
void blk_start_plug(struct blk_plug *plug)
|
|
|
|
{
|
|
|
|
struct task_struct *tsk = current;
|
|
|
|
|
2015-05-09 01:51:28 +08:00
|
|
|
/*
|
|
|
|
* If this is a nested plug, don't actually assign it.
|
|
|
|
*/
|
|
|
|
if (tsk->plug)
|
|
|
|
return;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
INIT_LIST_HEAD(&plug->mq_list);
|
2011-04-18 15:52:22 +08:00
|
|
|
INIT_LIST_HEAD(&plug->cb_list);
|
2018-11-24 13:04:33 +08:00
|
|
|
plug->rq_count = 0;
|
2018-11-28 08:13:56 +08:00
|
|
|
plug->multiple_queues = false;
|
2018-11-24 13:04:33 +08:00
|
|
|
|
2011-03-08 20:19:51 +08:00
|
|
|
/*
|
2015-05-09 01:51:28 +08:00
|
|
|
* Store ordering should not be needed here, since a potential
|
|
|
|
* preempt will imply a full memory barrier
|
2011-03-08 20:19:51 +08:00
|
|
|
*/
|
2015-05-09 01:51:28 +08:00
|
|
|
tsk->plug = plug;
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_start_plug);
|
|
|
|
|
2012-07-31 15:08:15 +08:00
|
|
|
static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
|
2011-04-18 15:52:22 +08:00
|
|
|
{
|
|
|
|
LIST_HEAD(callbacks);
|
|
|
|
|
2012-07-31 15:08:15 +08:00
|
|
|
while (!list_empty(&plug->cb_list)) {
|
|
|
|
list_splice_init(&plug->cb_list, &callbacks);
|
2011-04-18 15:52:22 +08:00
|
|
|
|
2012-07-31 15:08:15 +08:00
|
|
|
while (!list_empty(&callbacks)) {
|
|
|
|
struct blk_plug_cb *cb = list_first_entry(&callbacks,
|
2011-04-18 15:52:22 +08:00
|
|
|
struct blk_plug_cb,
|
|
|
|
list);
|
2012-07-31 15:08:15 +08:00
|
|
|
list_del(&cb->list);
|
2012-07-31 15:08:15 +08:00
|
|
|
cb->callback(cb, from_schedule);
|
2012-07-31 15:08:15 +08:00
|
|
|
}
|
2011-04-18 15:52:22 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-31 15:08:14 +08:00
|
|
|
struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data,
|
|
|
|
int size)
|
|
|
|
{
|
|
|
|
struct blk_plug *plug = current->plug;
|
|
|
|
struct blk_plug_cb *cb;
|
|
|
|
|
|
|
|
if (!plug)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
list_for_each_entry(cb, &plug->cb_list, list)
|
|
|
|
if (cb->callback == unplug && cb->data == data)
|
|
|
|
return cb;
|
|
|
|
|
|
|
|
/* Not currently on the callback list */
|
|
|
|
BUG_ON(size < sizeof(*cb));
|
|
|
|
cb = kzalloc(size, GFP_ATOMIC);
|
|
|
|
if (cb) {
|
|
|
|
cb->data = data;
|
|
|
|
cb->callback = unplug;
|
|
|
|
list_add(&cb->list, &plug->cb_list);
|
|
|
|
}
|
|
|
|
return cb;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_check_plugged);
|
|
|
|
|
2011-04-16 19:51:05 +08:00
|
|
|
void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
|
2011-03-08 20:19:51 +08:00
|
|
|
{
|
2012-07-31 15:08:15 +08:00
|
|
|
flush_plug_callbacks(plug, from_schedule);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 16:20:05 +08:00
|
|
|
|
|
|
|
if (!list_empty(&plug->mq_list))
|
|
|
|
blk_mq_flush_plug_list(plug, from_schedule);
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
|
|
|
|
2019-01-09 05:57:34 +08:00
|
|
|
/**
|
|
|
|
* blk_finish_plug - mark the end of a batch of submitted I/O
|
|
|
|
* @plug: The &struct blk_plug passed to blk_start_plug()
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Indicate that a batch of I/O submissions is complete. This function
|
|
|
|
* must be paired with an initial call to blk_start_plug(). The intent
|
|
|
|
* is to allow the block layer to optimize I/O submission. See the
|
|
|
|
* documentation for blk_start_plug() for more information.
|
|
|
|
*/
|
2011-03-08 20:19:51 +08:00
|
|
|
void blk_finish_plug(struct blk_plug *plug)
|
|
|
|
{
|
2015-05-09 01:51:28 +08:00
|
|
|
if (plug != current->plug)
|
|
|
|
return;
|
2011-04-15 21:49:07 +08:00
|
|
|
blk_flush_plug_list(plug, false);
|
2011-03-08 20:19:51 +08:00
|
|
|
|
2015-05-09 01:51:28 +08:00
|
|
|
current->plug = NULL;
|
2011-03-08 20:19:51 +08:00
|
|
|
}
|
2011-04-15 21:20:10 +08:00
|
|
|
EXPORT_SYMBOL(blk_finish_plug);
|
2011-03-08 20:19:51 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
int __init blk_dev_init(void)
|
|
|
|
{
|
2016-10-28 22:48:16 +08:00
|
|
|
BUILD_BUG_ON(REQ_OP_LAST >= (1 << REQ_OP_BITS));
|
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2019-12-10 02:31:43 +08:00
|
|
|
sizeof_field(struct request, cmd_flags));
|
2016-10-28 22:48:16 +08:00
|
|
|
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
|
2019-12-10 02:31:43 +08:00
|
|
|
sizeof_field(struct bio, bi_opf));
|
2009-04-27 20:53:54 +08:00
|
|
|
|
2011-01-03 22:01:47 +08:00
|
|
|
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
|
|
|
|
kblockd_workqueue = alloc_workqueue("kblockd",
|
2014-06-12 05:43:54 +08:00
|
|
|
WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!kblockd_workqueue)
|
|
|
|
panic("Failed to create kblockd\n");
|
|
|
|
|
2015-11-21 05:16:46 +08:00
|
|
|
blk_requestq_cachep = kmem_cache_create("request_queue",
|
2007-07-24 15:28:11 +08:00
|
|
|
sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-02-01 06:53:20 +08:00
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
blk_debugfs_root = debugfs_create_dir("block", NULL);
|
|
|
|
#endif
|
|
|
|
|
2008-01-24 15:53:35 +08:00
|
|
|
return 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|