end_cmd finishes request associated with nullb_cmd struct, so we
should save pointer to request_queue in a local variable before
calling end_cmd.
The problem was causes general protection fault with slab poisoning
enabled.
Fixes: 8b70f45e2e ("null_blk: restart request processing on completion handler")
Tested-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Main excitement here is Peter Zijlstra's lockless rbtree optimization to
speed module address lookup. He found some abusers of the module lock
doing that too.
A little bit of parameter work here too; including Dan Streetman's breaking
up the big param mutex so writing a parameter can load another module (yeah,
really). Unfortunately that broke the usual suspects, !CONFIG_MODULES and
!CONFIG_SYSFS, so those fixes were appended too.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVkgKHAAoJENkgDmzRrbjxQpwQAJVmBN6jF3SnwbQXv9vRixjH
58V33sb1G1RW+kXxQ3/e8jLX/4VaN479CufruXQp+IJWXsN/CH0lbC3k8m7u50d7
b1Zeqd/Yrh79rkc11b0X1698uGCSMlzz+V54Z0QOTEEX+nSu2ZZvccFS4UaHkn3z
rqDo00lb7rxQz8U25qro2OZrG6D3ub2q20TkWUB8EO4AOHkPn8KWP2r429Axrr0K
wlDWDTTt8/IsvPbuPf3T15RAhq1avkMXWn9nDXDjyWbpLfTn8NFnWmtesgY7Jl4t
GjbXC5WYekX3w2ZDB9KaT/DAMQ1a7RbMXNSz4RX4VbzDl+yYeSLmIh2G9fZb1PbB
PsIxrOgy4BquOWsJPm+zeFPSC3q9Cfu219L4AmxSjiZxC3dlosg5rIB892Mjoyv4
qxmg6oiqtc4Jxv+Gl9lRFVOqyHZrTC5IJ+xgfv1EyP6kKMUKLlDZtxZAuQxpUyxR
HZLq220RYnYSvkWauikq4M8fqFM8bdt6hLJnv7bVqllseROk9stCvjSiE3A9szH5
OgtOfYV5GhOeb8pCZqJKlGDw+RoJ21jtNCgOr6DgkNKV9CX/kL/Puwv8gnA0B0eh
dxCeB7f/gcLl7Cg3Z3gVVcGlgak6JWrLf5ITAJhBZ8Lv+AtL2DKmwEWS/iIMRmek
tLdh/a9GiCitqS0bT7GE
=tWPQ
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Main excitement here is Peter Zijlstra's lockless rbtree optimization
to speed module address lookup. He found some abusers of the module
lock doing that too.
A little bit of parameter work here too; including Dan Streetman's
breaking up the big param mutex so writing a parameter can load
another module (yeah, really). Unfortunately that broke the usual
suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
appended too"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
modules: only use mod->param_lock if CONFIG_MODULES
param: fix module param locks when !CONFIG_SYSFS.
rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
module: add per-module param_lock
module: make perm const
params: suppress unused variable error, warn once just in case code changes.
modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
kernel/module.c: avoid ifdefs for sig_enforce declaration
kernel/workqueue.c: remove ifdefs over wq_power_efficient
kernel/params.c: export param_ops_bool_enable_only
kernel/params.c: generalize bool_enable_only
kernel/module.c: use generic module param operaters for sig_enforce
kernel/params: constify struct kernel_param_ops uses
sysfs: tightened sysfs permission checks
module: Rework module_addr_{min,max}
module: Use __module_address() for module_address_lookup()
module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
module: Optimize __module_address() using a latched RB-tree
rbtree: Implement generic latch_tree
seqlock: Introduce raw_read_seqcount_latch()
...
When irqmode=2 (IRQ completion handler is timer) and queue_mode=1
(Block interface to use is rq), the completion handler should restart
request handling for any pending requests on a queue because request
processing stops when the number of commands are queued more than
hw_queue_depth (null_rq_prep_fn returns BLKPREP_DEFER).
Without this change, the following command cannot finish.
# modprobe null_blk irqmode=2 queue_mode=1 hw_queue_depth=1
# fio --name=t --rw=read --size=1g --direct=1 \
--ioengine=libaio --iodepth=64 --filename=/dev/nullb0
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When irqmode=2 (IRQ completion handler is timer), timer handler should
be called on the same CPU where the timer has been started.
Since completion_queues are per-cpu and the completion handler only
touches completion_queue for local CPU, we need to prevent the handler
from running on a different CPU where the timer has been started.
Otherwise, the IO cannot be completed until another completion handler
is executed on that CPU.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most code already uses consts for the struct kernel_param_ops,
sweep the kernel for the last offending stragglers. Other than
include/linux/moduleparam.h and kernel/params.c all other changes
were generated with the following Coccinelle SmPL patch. Merge
conflicts between trees can be handled with Coccinelle.
In the future git could get Coccinelle merge support to deal with
patch --> fail --> grammar --> Coccinelle --> new patch conflicts
automatically for us on patches where the grammar is available and
the patch is of high confidence. Consider this a feature request.
Test compiled on x86_64 against:
* allnoconfig
* allmodconfig
* allyesconfig
@ const_found @
identifier ops;
@@
const struct kernel_param_ops ops = {
};
@ const_not_found depends on !const_found @
identifier ops;
@@
-struct kernel_param_ops ops = {
+const struct kernel_param_ops ops = {
};
Generated-by: Coccinelle SmPL
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Junio C Hamano <gitster@pobox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: cocci@systeme.lip6.fr
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
null_blk is partitionable, but it doesn't store any of the info. When
it is loaded, you would normally see:
[1226739.343608] nullb0: unknown partition table
[1226739.343746] nullb1: unknown partition table
which can confuse some people. Add the appropriate gendisk flag
to suppress this info.
Signed-off-by: Jens Axboe <axboe@fb.com>
Check IS_ERR_OR_NULL(return value) instead of just return value.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reduced to IS_ERR() by me, we never return NULL.
Signed-off-by: Jens Axboe <axboe@fb.com>
When either queue_mode or irq_mode parameter is set outside its
boundaries, the driver will not complete requests. This stalls driver
initialization when partitions are probed. Fix by setting out of bound
values to the parameters default.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Updated by me to have the parse+check in just one function.
Signed-off-by: Jens Axboe <axboe@fb.com>
Since we have the notion of a 'last' request in a chain, we can use
this to have the hardware optimize the issuing of requests. Add
a list_head parameter to queue_rq that the driver can use to
temporarily store hw commands for issue when 'last' is true. If we
are doing a chain of requests, pass in a NULL list for the first
request to force issue of that immediately, then batch the remainder
for deferred issue until the last request has been sent.
Instead of adding yet another argument to the hot ->queue_rq path,
encapsulate the passed arguments in a blk_mq_queue_data structure.
This is passed as a constant, and has been tested as faster than
passing 4 (or even 3) args through ->queue_rq. Update drivers for
the new ->queue_rq() prototype. There are no functional changes
in this patch for drivers - if they don't use the passed in list,
then they will just queue requests individually like before.
Signed-off-by: Jens Axboe <axboe@fb.com>
When creation of queues fails in init_driver_queues(), we free the
queues. But null_add_dev() doesn't test for this failure and continues
with the setup leading to strange consequences, likely oops. Fix the
problem by testing whether init_driver_queues() failed and do proper
error cleanup.
Coverity-id: 1148005
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer driver update from Jens Axboe:
"This is the block driver pull request for 3.18. Not a lot in there
this round, and nothing earth shattering.
- A round of drbd fixes from the linbit team, and an improvement in
asender performance.
- Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and
hd from Michael Opdenacker.
- Disable entropy collection from flash devices by default, from Mike
Snitzer.
- A small collection of xen blkfront/back fixes from Roger Pau Monné
and Vitaly Kuznetsov"
* 'for-3.18/drivers' of git://git.kernel.dk/linux-block:
block: disable entropy contributions for nonrot devices
xen, blkfront: factor out flush-related checks from do_blkif_request()
xen-blkback: fix leak on grant map error path
xen/blkback: unmap all persistent grants when frontend gets disconnected
rsxx: Remove deprecated IRQF_DISABLED
block: hd: remove deprecated IRQF_DISABLED
drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks
drbd: compute the end before rb_insert_augmented()
drbd: Add missing newline in resync progress display in /proc/drbd
drbd: reduce lock contention in drbd_worker
drbd: Improve asender performance
drbd: Get rid of the WORK_PENDING macro
drbd: Get rid of the __no_warn and __cond_lock macros
drbd: Avoid inconsistent locking warning
drbd: Remove superfluous newline from "resync_extents" debugfs entry.
drbd: Use consistent names for all the bi_end_io callbacks
drbd: Use better variable names
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.
Historically, all block devices have automatically made entropy
contributions. But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
- On SSD disks, the completion times aren't as random as they
are for rotational drives. So it's questionable whether they
should contribute to the random pool in the first place.
- Calling add_disk_randomness() has a lot of overhead.
There are more reliable sources for randomness than non-rotational block
devices. From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we've changed the driver API on the submission side use the
opportunity to fix up the name on the completion side to fit into the
general scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.
Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pass an explicit parameter for the last request in a batch to ->queue_rq
instead of using a request flag. Besides being a cleaner and non-stateful
interface this is also required for the next patch, which fixes the blk-mq
I/O submission code to not start a time too early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Two of the blk-mq based drivers do not pass back the return value
from blk_mq_alloc_tag_set, instead just returning -ENOMEM.
blk_mq_alloc_tag_set returns -EINVAL if the number of queues or
queue depth is bad. -ENOMEM implies that retrying after freeing some
memory might be more successful, but that won't ever change
in the -EINVAL cases.
Change the null_blk and mtip32xx drivers to pass along
the return value.
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Only blk-mq completions have payload attached to the request, for
request_fn mode we have stored it in req->special. This fixes an
oops with queue_mode=1 and softirq completions.
Signed-off-by: Jens Axboe <axboe@fb.com>
'use_mq' is not the name of the module parameter, 'queue_mode' is.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull in core changes (again), since we got rid of the alloc/free
hctx mq_ops hooks and mtip32xx then needed updating again.
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the
core changes so we have a properly merged end result.
Signed-off-by: Jens Axboe <axboe@fb.com>
Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.
So just pass in the suggested node directly to the alloc
function.
Signed-off-by: Jens Axboe <axboe@fb.com>
entry(cmd->ll_list) may belong to new request once end_cmd()
returns, so fix the bug with the patch.
Without the change, it is easy to observe oops when
doing null_blk(timer) test.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a new blk_mq_tag_set structure that gets set up before we initialize
the queue. A single blk_mq_tag_set structure can be shared by multiple
queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Modular export of blk_mq_{alloc,free}_tagset added by me.
Signed-off-by: Jens Axboe <axboe@fb.com>
Drivers can reach their private data easily using the blk_mq_rq_to_pdu
helper and don't need req->special. By not initializing it code can
be simplified nicely, and we also shave off a few more instructions from
the I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the block layer helpers for CPU-local completions instead of
reimplementing them locally.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The completion queue is implemented using lockless list.
The llist_add is adds the events to the list head which is a push operation.
The processing of the completion elements is done by disconnecting all the
pushed elements and iterating over the disconnected list. The problem is
that the processing is done in reverse order w.r.t order of the insertion
i.e. LIFO processing. By reversing the disconnected list which is done in
linear time the desired FIFO processing is achieved.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block IO driver changes from Jens Axboe:
- bcache update from Kent Overstreet.
- two bcache fixes from Nicholas Swenson.
- cciss pci init error fix from Andrew.
- underflow fix in the parallel IDE pg_write code from Dan Carpenter.
I'm sure the 1 (or 0) users of that are now happy.
- two PCI related fixes for sx8 from Jingoo Han.
- floppy init fix for first block read from Jiri Kosina.
- pktcdvd error return miss fix from Julia Lawall.
- removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
Michael Opdenacker.
- comment typo fix for the loop driver from Olaf Hering.
- potential oops fix for null_blk from Raghavendra K T.
- two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
an OOM problem and a problem with handling security locked conditions
* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
mg_disk: Spelling s/finised/finished/
null_blk: Null pointer deference problem in alloc_page_buffers
mtip32xx: Correctly handle security locked condition
mtip32xx: Make SGL container per-command to eliminate high order dma allocation
drivers/block/loop.c: fix comment typo in loop_config_discard
drivers/block/cciss.c:cciss_init_one(): use proper errnos
drivers/block/paride/pg.c: underflow bug in pg_write()
drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
drivers/block/sx8.c: use module_pci_driver()
floppy: bail out in open() if drive is not responding to block0 read
bcache: Fix auxiliary search trees for key size > cacheline size
bcache: Don't return -EINTR when insert finished
bcache: Improve bucket_prio() calculation
bcache: Add bch_bkey_equal_header()
bcache: update bch_bkey_try_merge
bcache: Move insert_fixup() to btree_keys_ops
bcache: Convert sorting to btree_keys
bcache: Convert debug code to btree_keys
bcache: Convert btree_iter to struct btree_keys
bcache: Refactor bset_tree sysfs stats
...
If we load the null_blk module with bs=8k we get following oops:
[ 3819.812190] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 3819.812387] IP: [<ffffffff81170aa5>] create_empty_buffers+0x28/0xaf
[ 3819.812527] PGD 219244067 PUD 215a06067 PMD 0
[ 3819.812640] Oops: 0000 [#1] SMP
[ 3819.812772] Modules linked in: null_blk(+)
Fix that by resetting block size to PAGE_SIZE if it is greater than PAGE_SIZE
Reported-by: Sumanth <sumantk2@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When queue_mode is NULL_Q_MQ and null_blk is being removed,
blk_cleanup_queue() isn't called to cleanup queue, so the queue
allocated won't be freed.
This patch calls blk_cleanup_queue() for MQ to drain all pending
requests first and release the reference counter of queue kobject, then
blk_mq_free_queue() will be called in queue kobject's release handler
when queue kobject's reference counter drops to zero.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block fixes from Jens Axboe:
- fix for a memory leak on certain unplug events
- a collection of bcache fixes from Kent and Nicolas
- a few null_blk fixes and updates form Matias
- a marking of static of functions in the stec pci-e driver
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: support submit_queues on use_per_node_hctx
null_blk: set use_per_node_hctx param to false
null_blk: corrections to documentation
null_blk: warning on ignored submit_queues param
null_blk: refactor init and init errors code paths
null_blk: documentation
null_blk: mem garbage on NUMA systems during init
drivers: block: Mark the functions as static in skd_main.c
bcache: New writeback PD controller
bcache: bugfix for race between moving_gc and bucket_invalidate
bcache: fix for gc and writeback race
bcache: bugfix - moving_gc now moves only correct buckets
bcache: fix for gc crashing when no sectors are used
bcache: Fix heap_peek() macro
bcache: Fix for can_attach_cache()
bcache: Fix dirty_data accounting
bcache: Use uninterruptible sleep in writeback
bcache: kthread don't set writeback task to INTERUPTIBLE
block: fix memory leaks on unplugging block device
bcache: fix sparse non static symbol warning
In the case of both the submit_queues param and use_per_node_hctx param
are used. We limit the number af submit_queues to the number of online
nodes.
If the submit_queues is a multiple of nr_online_nodes, its trivial. Simply map
them to the nodes. For example: 8 submit queues are mapped as node0[0,1],
node1[2,3], ...
If uneven, we are left with an uneven number of submit_queues that must be
mapped. These are mapped toward the first node and onward. E.g. 5
submit queues mapped onto 4 nodes are mapped as node0[0,1], node1[2], ...
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The defaults for the module is to instantiate itself with blk-mq and a
submit queue for each CPU node in the system.
To save resources, initialize instead with a single submit queue.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let the user know when the number of submission queues are being
ignored.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify the initialization logic of the three block-layers.
- The queue initialization is split into two parts. This allows reuse of
code when initializing the sq-, bio- and mq-based layers.
- Set submit_queues default value to 0 and always set it at init time.
- Simplify the init error code paths.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For NUMA systems, initializing the blk-mq layer and using per node hctx.
We initialize submit queues to 1, while blk-mq nr_hw_queues is
initialized to the number of NUMA nodes.
This makes the null_init_hctx function overwrite memory outside of what
it allocated. In my case it lead to writing garbage into struct
request_queue's mq_map.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For NUMA systems, initializing the blk-mq layer and using per node hctx.
We initialize submit queues to 1, while blk-mq nr_hw_queues is
initialized to the number of NUMA nodes.
This makes the null_init_hctx function overwrite memory outside of what
it allocated. In my case it lead to writing garbage into struct
request_queue's mq_map.
Signed-off-by: Matias Bjorling <m@bjorling.me>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove CONFIG_USE_GENERIC_SMP_HELPERS left by commit 0a06ff068f
("kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS").
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A driver that simply completes IO it receives, it does no
transfers. Written to fascilitate testing of the blk-mq code.
It supports various module options to use either bio queueing,
rq queueing, or mq mode.
Signed-off-by: Jens Axboe <axboe@kernel.dk>