Convert dm to embedded bio sets.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJadzbSAAoJEPfTWPspceCmt5QP/jo6MSsNVevAQOE75Jje+qa/
aF/BjHBdUmmI5WtPrtoz4igaJou7M2U0s8jdsc3c7uMw8dGTKc6ujIquSEn0wevY
faJPTjWzLum3y50gwRHcrHCQIlxOe5/f9rJevW4+q76aMP3aWKjO4bgBExH+2XnA
CaT+6d40skYt20Sy428H0yhVdDAMiQYXTeg4SssWQY9AvJSSiW7ax+vmP3r5BKpV
dXHggwgzqDuMwLZG80Tfg4GHGv5qisIrqLOCxtXNYHDNb/aDmbTFTO2jPgobT8gW
N2kWxsOkBayUdPw6Nt2Wlm4toQgR5GJGH04LH2vI5p4dp4Grvx/aFGvUbT7+sN1u
g/mmqsUUnYuO5AJ8XY2s2F7ezaT6v9x8BbLHuA2vz0r5GsdFVXctZ/bXgQqkmh9i
KLtfyOPldlczclVEuKL4xai1aXLcoBzDwyLxzbFp3+eAlhcgoSqxnMsE4fCJblCU
dfShDChu1SbBD6dyGx8sol9cT48RFj2tBtpfcYxFW/NJJOQoh9FTqPQetYQxQ72c
TadEf40hmw5Q2l0Hu5pwVbKHWUP0wn0VznkAOfT4VV1ysk93oExMbjgS2qh16xEZ
oQwFDQMk3D8BXI9VwH8gUUnypkhcooMhznxSC3BQxjGn/R+byp7QEPvxSEZz/4nD
BaBSbyAU5cpof+Eaqs4B
=qeDb
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180204' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"Most of this is fixes and not new code/features:
- skd fix from Arnd, fixing a build error dependent on sla allocator
type.
- blk-mq scheduler discard merging fixes, one from me and one from
Keith. This fixes a segment miscalculation for blk-mq-sched, where
we mistakenly think two segments are physically contigious even
though the request isn't carrying real data. Also fixes a bio-to-rq
merge case.
- Don't re-set a bit on the buffer_head flags, if it's already set.
This can cause scalability concerns on bigger machines and
workloads. From Kemi Wang.
- Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to
distuingish between a local (device related) resource starvation
and a global one. The latter might happen without IO being in
flight, so it has to be handled a bit differently. From Ming"
* tag 'for-linus-20180204' of git://git.kernel.dk/linux-block:
block: skd: fix incorrect linux/slab_def.h inclusion
buffer: Avoid setting buffer bits that are already set
blk-mq-sched: Enable merging discard bio into request
blk-mq: fix discard merge with scheduler attached
blk-mq: introduce BLK_STS_DEV_RESOURCE
walk; this is critical to allow forward progress without the need to
use the bioset's BIOSET_NEED_RESCUER.
- Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.
- DM core cleanups and improvements to make bio-based DM more efficient
(e.g. reduced memory footprint as well leveraging per-bio-data more).
- Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
the more direct IO submission path in the block layer; this mode is
used by DM multipath and also optimizes targets like DM thin-pool that
stack directly on NVMe data device.
- DM multipath improvements to factor out legacy SCSI-only
(e.g. scsi_dh) code paths to allow for more optimized support for NVMe
multipath.
- A fix for DM multipath path selectors (service-time and queue-length)
to select paths in a more balanced way; largely academic but doesn't
hurt.
- Numerous DM raid target fixes and improvements.
- Add a new DM "unstriped" target that enables Intel to workaround
firmware limitations in some NVMe drives that are striped internally
(this target also works when stacked above the DM "striped" target).
- Various Documentation fixes and improvements.
- Misc. cleanups and fixes across various DM infrastructure and targets
(e.g. bufio, flakey, log-writes, snapshot).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJacgwPAAoJEMUj8QotnQNaEw0H/0XRTcg8/lRuGl46kdeI3PgR
ZxUy4XgUrCLiACWO5yCU/nKipB32+3xTlTDTBcjmaBfX8HolH147Pasb1KdHqLVC
dOWLMpjlFztb5fnuOMitJA05qQAbgRlZ52QdVk/FDo9yWicgWjQZduh8aYX53pHw
6XOYWzSFAXQcaduPdz6TLiPw479xBwIpXxQbrO09f4qt3Ub4bqknEhzFXc+6M7zl
ejmW/bG2Qg6WmsfAuaAhFTV0LpTPSEzvaq9TfR7yqFU3DvDIAi7Yh8eQinIUDo4u
txpOGoESRAMPAMKH0/UJdr/u7jTsfgJox4QEavWfnViPvkouah5KdjVOL1veZ5U=
=R3dN
-----END PGP SIGNATURE-----
Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core fixes to ensure that bio submission follows a depth-first
tree walk; this is critical to allow forward progress without the
need to use the bioset's BIOSET_NEED_RESCUER.
- Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.
- DM core cleanups and improvements to make bio-based DM more efficient
(e.g. reduced memory footprint as well leveraging per-bio-data more).
- Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
the more direct IO submission path in the block layer; this mode is
used by DM multipath and also optimizes targets like DM thin-pool
that stack directly on NVMe data device.
- DM multipath improvements to factor out legacy SCSI-only (e.g.
scsi_dh) code paths to allow for more optimized support for NVMe
multipath.
- A fix for DM multipath path selectors (service-time and queue-length)
to select paths in a more balanced way; largely academic but doesn't
hurt.
- Numerous DM raid target fixes and improvements.
- Add a new DM "unstriped" target that enables Intel to workaround
firmware limitations in some NVMe drives that are striped internally
(this target also works when stacked above the DM "striped" target).
- Various Documentation fixes and improvements.
- Misc cleanups and fixes across various DM infrastructure and targets
(e.g. bufio, flakey, log-writes, snapshot).
* tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits)
dm cache: Documentation: update default migration_throttling value
dm mpath selector: more evenly distribute ties
dm unstripe: fix target length versus number of stripes size check
dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
dm table: fix NVMe bio-based dm_table_determine_type() validation
dm: various cleanups to md->queue initialization code
dm mpath: delay the retry of a request if the target responded as busy
dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
dm log writes: fix max length used for kstrndup
dm: backfill missing calls to mutex_destroy()
dm snapshot: use mutex instead of rw_semaphore
dm flakey: check for null arg_name in parse_features()
dm thin: extend thinpool status format string with omitted fields
dm thin: fixes in thin-provisioning.txt
dm thin: document representation of <highest mapped sector> when there is none
dm thin: fix documentation relative to low water mark threshold
dm cache: be consistent in specifying sectors and SI units in cache.txt
dm cache: delete obsoleted paragraph in cache.txt
dm cache: fix grammar in cache-policies.txt
...
This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.
Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.
If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:
1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();
2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
completed via blk_mq_sched_restart()
3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.
One invariant is that queue will be rerun if SCHED_RESTART is set.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's
multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue.
This is beneficial to do if BLK_STS_RESOURCE is returned from the target
(because target is busy).
Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay.
For old .request_fn: use blk_delay_queue().
bio-based multipath doesn't have feature parity with request-based for
retryable error requeues; that is something that'll need fixing in the
future.
Suggested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Bart Van Assche <bart.vanassche@wdc.com>
[as interpreted from Bart's "... patch looks fine to me."]
blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.
1) This way isn't efficient enough because the hctx spinlock is always
used.
2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO. But dm-rq currently can't get the underlying queue's
dispatch feedback at all. Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging). This
obviously causes very bad sequential IO performance.
Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request. Whereby preserving the opportunity to
merge IO.
With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
DM is no longer prone to having its request_queue be improperly
initialized.
Summary of changes:
- defer DM's blk_register_queue() from add_disk()-time until
dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
- dm_setup_md_queue() is updated to fully initialize DM's request_queue
(_after_ all table loads have occurred and the request_queue's type,
features and limits are known).
A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
blk_register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "iosched" exposed in syfs.
In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before dm_setup_md_queue() calls blk_register_queue().
These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/
In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure provided by
the request_queue).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need to update the original dm request partially when ending
each cloned bio: just update original dm request once when the whole
cloned request is finished. This still allows full support for partial
completion because a new 'completed' counter accounts for incremental
progress as the clone bios complete.
Partial request update can be a bit expensive, so we should try to avoid
it, especially because it is run in softirq context.
Avoiding all the partial request updates fixes both hard lockup and
soft lockups that were easily reproduced while running Laurence's
test[1] on IB/SRP.
BTW, after d4acf3650c ("block: Make blk_mq_delay_kick_requeue_list()
rerun the queue at a quiet time"), we need to make the test more
aggressive for reproducing the lockup:
1) run hammer_write.sh 32 or 64 concurrently.
2) write 8M each time
[1] https://marc.info/?l=linux-block&m=150220185510245&w=2
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
DM_MAPIO_DELAY_REQUEUE causes dm-mq to requeue after a delay but
causes dm-sq to requeue immediately. Make the behavior of dm-sq
consistent with that of dm-mq.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
blk_mq_unquiesce_queue() is used for unquiescing the
queue explicitly, so replace blk_mq_start_stopped_hw_queues()
with it.
For the scsi part, this patch takes Bart's suggestion to
switch to block quiesce/unquiesce API completely.
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: dm-devel@redhat.com
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.
For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.
blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't want to bug when receiving a DM_MAPIO_KILL value..
Fixes: 412445ac ("dm: introduce a new DM_MAPIO_KILL return value")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull block fixes and updates from Jens Axboe:
"Some fixes and followup features/changes that should go in, in this
merge window. This contains:
- Two fixes for lightnvm from Javier, fixing problems in the new code
merge previously in this merge window.
- A fix from Jan for the backing device changes, fixing an issue in
NFS that causes a failure to mount on certain setups.
- A change from Christoph, cleaning up the blk-mq init and exit
request paths.
- Remove elevator_change(), which is now unused. From Bart.
- A fix for queue operation invocation on a dead queue, from Bart.
- A series fixing up mtip32xx for blk-mq scheduling, removing a
bandaid we previously had in place for this. From me.
- A regression fix for this series, fixing a case where we wait on
workqueue flushing from an invalid (non-blocking) context. From me.
- A fix/optimization from Ming, ensuring that we don't both quiesce
and freeze a queue at the same time.
- A fix from Peter on lock ordering for CPU hotplug. Not a real
problem right now, but will be once the CPU hotplug rework goes in.
- A series from Omar, cleaning up out blk-mq debugfs support, and
adding support for exporting info from schedulers in debugfs as
well. This is really useful in debugging stalls or livelocks. From
Omar"
* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
mq-deadline: add debugfs attributes
kyber: add debugfs attributes
blk-mq-debugfs: allow schedulers to register debugfs attributes
blk-mq: untangle debugfs and sysfs
blk-mq: move debugfs declarations to a separate header file
blk-mq: Do not invoke queue operations on a dead queue
blk-mq-debugfs: get rid of a bunch of boilerplate
blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
blk-mq-debugfs: don't open code strstrip()
blk-mq-debugfs: error on long write to queue "state" file
blk-mq-debugfs: clean up flag definitions
blk-mq-debugfs: separate flags with |
nfs: Fix bdi handling for cloned superblocks
block/mq: Cure cpu hotplug lock inversion
lightnvm: fix bad back free on error path
lightnvm: create cmd before allocating request
blk-mq: don't use sync workqueue flushing from drivers
mtip32xx: convert internal commands to regular block infrastructure
mtip32xx: cleanup internal tag assumptions
block: don't call blk_mq_quiesce_queue() after queue is frozen
...
Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq. Except for a superflous check in
mtip32xx it was unused anyway.
Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This untangles the DM_MAPIO_* values returned from ->clone_and_map_rq
from the error codes used by the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Instead of returning either a DM_ENDIO_* constant or an error code, add
a new DM_ENDIO_DONE value that means keep errno as is. This allows us
to easily keep the existing error code in case where we can't push back,
and it also preparares for the new block level status codes with strict
type checking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Otherwise the request-based DM blk-mq request_queue will be put into
service without being properly exported via sysfs.
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If blk_get_request() returns ENODEV then multipath_clone_and_map()
causes a request to be requeued immediately. This can cause a kworker
thread to spend 100% of the CPU time of a single core in
__blk_mq_run_hw_queue() and also can cause device removal to never
finish.
Avoid this by only requeuing after a delay if blk_get_request() fails.
Additionally, reduce the requeue delay.
Cc: stable@vger.kernel.org # 4.9+
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
dm never uses rq->errors, so there is no need to pass an error argument
to blk_mq_complete_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <axboe@fb.com>
While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:
echo run >/sys/kernel/debug/block/dm-0/state
This patch avoids that such request processing stalls occur. The
test I ran is as follows:
while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Constify all instances of blk_mq_ops, as they are never modified.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart reported a case where dm would crash with use-after-free
poison. This is due to dm_softirq_done() accessing memory
associated with a request after calling end_request on it.
This is most visible on !blk-mq, since we free the memory
immediately for that case.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: eb8db831be ("dm: always defer request allocation to the owner of the request_queue")
Signed-off-by: Jens Axboe <axboe@fb.com>
Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
being passed from dm_old_request_fn() that races with DM device
destruction.
Reported-by: artem@flashgrid.io
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
DM already calls blk_mq_alloc_request on the request_queue of the
underlying device if it is a blk-mq device. But now that we allow drivers
to allocate additional data and initialize it ahead of time we need to do
the same for all drivers. Doing so and using the new cmd_size
infrastructure in the block layer greatly simplifies the dm-rq and mpath
code, and should also make arbitrary combinations of SQ and MQ devices
with SQ or MQ device mapper tables easily possible as a further step.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
DM tries to copy a few fields around for BLOCK_PC requests, but given
that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually
be sent to dm.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Return an errno value instead of the passed in queue so that the callers
don't have to keep track of two queues, and move the assignment of the
request_fn and lock to the caller as passing them as argument doesn't
simplify anything. While we're at it also remove two pointless NULL
assignments, given that the request structure is zeroed on allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
. some locking improvements in DM bufio
. add Kconfig option to disable the DM block manager's extra locking
which mainly serves as a developer tool
. a few bug fixes to DM's persistent-data
. a couple changes to prepare for multipage biovec support in the block
layer
. various improvements and cleanups in the DM core, DM cache, DM raid
and DM crypt
. add ability to have DM crypt use keys from the kernel key retention
service
. add a new "error_writes" feature to the DM flakey target, reads are
left unchanged in this mode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJYUW8zAAoJEMUj8QotnQNaAWEIAMRQ4aCXq5T7F9Hf4K/l6FwO
FoBr2TPS3Lf0vm/A5Tr819I47hk7q0oroa61ARbpS90iuGt/Au/Sk35cn1BwT0YW
llMvMGbh+w9ZBUJGkyexdXbyfm5ywPHuthMr4CK/UNASyjDl2QMAeBuUZ6FLSPn1
RUL/RYv0mG/7EXOPz0PURPb5rpjO15cAU0NjfNS0862UVR8x8dNS6iImOmScsioe
Flw90qPl3kMBxBHik8xSPJfhtW+lD7xSaOlWzHKtalnUZHRG2BNUtlAMKdiaynx2
yl9MhSsi8wlgd4h9WmlmaOr0VqkU5UYY9D9TDuuJwXnHUXGenVSJ/aGOohr+bm4=
=kOoK
-----END PGP SIGNATURE-----
Merge tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- various fixes and improvements to request-based DM and DM multipath
- some locking improvements in DM bufio
- add Kconfig option to disable the DM block manager's extra locking
which mainly serves as a developer tool
- a few bug fixes to DM's persistent-data
- a couple changes to prepare for multipage biovec support in the block
layer
- various improvements and cleanups in the DM core, DM cache, DM raid
and DM crypt
- add ability to have DM crypt use keys from the kernel key retention
service
- add a new "error_writes" feature to the DM flakey target, reads are
left unchanged in this mode
* tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (40 commits)
dm flakey: introduce "error_writes" feature
dm cache policy smq: use hash_32() instead of hash_32_generic()
dm crypt: reject key strings containing whitespace chars
dm space map: always set ev if sm_ll_mutate() succeeds
dm space map metadata: skip useless memcpy in metadata_ll_init_index()
dm space map metadata: fix 'struct sm_metadata' leak on failed create
Documentation: dm raid: define data_offset status field
dm raid: fix discard support regression
dm raid: don't allow "write behind" with raid4/5/6
dm mpath: use hw_handler_params if attached hw_handler is same as requested
dm crypt: add ability to use keys from the kernel key retention service
dm array: remove a dead assignment in populate_ablock_with_values()
dm ioctl: use offsetof() instead of open-coding it
dm rq: simplify use_blk_mq initialization
dm: use blk_set_queue_dying() in __dm_destroy()
dm bufio: drop the lock when doing GFP_NOIO allocation
dm bufio: don't take the lock in dm_bufio_shrink_count
dm bufio: avoid sleeping while holding the dm_bufio lock
dm table: simplify dm_table_determine_type()
dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
Use a single statement to declare and initialize 'use_blk_mq' instead
of two statements.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Avoid accessing .bi_vcnt directly, because the bio can be split from
block layer and .bi_vcnt should never have been used here.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
It is required to hold the queue lock when calling blk_run_queue_async()
to avoid that a race between blk_run_queue_async() and
blk_cleanup_queue() is triggered.
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
calls have stopped before setting the "queue stopped" flag. This
allows to remove the "queue stopped" test from dm_mq_queue_rq() and
dm_mq_requeue_request(). This patch fixes a race condition because
dm_mq_queue_rq() is called without holding the queue lock and hence
BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
in progress. This patch prevents that the following hang occurs
sporadically when using dm-mq:
INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
Call Trace:
[<ffffffff8161f397>] schedule+0x37/0x90
[<ffffffff816239ef>] schedule_timeout+0x27f/0x470
[<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110
[<ffffffff8161fb36>] bit_wait_io+0x16/0x60
[<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0
[<ffffffff8114fe69>] __lock_page+0xb9/0xc0
[<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760
[<ffffffff81166120>] truncate_inode_pages+0x10/0x20
[<ffffffff81212a20>] kill_bdev+0x30/0x40
[<ffffffff81213d41>] __blkdev_put+0x71/0x360
[<ffffffff81214079>] blkdev_put+0x49/0x170
[<ffffffff812141c0>] blkdev_close+0x20/0x30
[<ffffffff811d48e8>] __fput+0xe8/0x1f0
[<ffffffff811d4a29>] ____fput+0x9/0x10
[<ffffffff810842d3>] task_work_run+0x83/0xb0
[<ffffffff8106606e>] do_exit+0x3ee/0xc40
[<ffffffff8106694b>] do_group_exit+0x4b/0xc0
[<ffffffff81073d9a>] get_signal+0x2ca/0x940
[<ffffffff8101bf43>] do_signal+0x23/0x660
[<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0
[<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0
[<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
in the dm start and stop queue functions, only manipulate the latter
flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
if __dm_suspend() or __dm_resume() is called and not if the
requeue list is processed.
* In the SCSI core queue stopping and starting should only be
performed by the scsi_internal_device_block() and
scsi_internal_device_unblock() functions but not by any other
function. Although the blk_mq_stop_hw_queue() call in
scsi_queue_rq() may help to reduce CPU load if a LLD queue is
full, figuring out whether or not a queue should be restarted
when requeueing a command would require to introduce additional
locking in scsi_mq_requeue_cmd() to avoid a race with
scsi_internal_device_block(). Avoid this complexity by removing
the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
blk_mq_start_stopped_hw_queues() explicitly should start stopped
queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
xen-blkfront driver in its blkif_recover() function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
- A couple .request_fn request-based DM NULL pointer fixes
- A fix for a DM target reference count leak, on target load error, that
prevented associated DM target kernel module(s) from being removed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJYEo+lAAoJEMUj8QotnQNaGfkH/jGqr4bj4l2Ty3QgV95fYW7+
lqp4Flkevm35HotEGKuuizvqbbVrj57BCGLE+dV48/X2cv5QbUFht6QBu9iJTrk6
Q7VqyBOvDDnOZHIof5CfKBeLZ2gd8YHZwUpYvzJcThSWS1+LjeVqg8a33LMZroMQ
rghVxFCIKy6LqCryIiTHk1t+OfmuBz3S2LXcQXFY7XAPpWq/f+V66gthTZUpm86+
Gu1xOHQlvnmf5xnDUxCpPVbQNY334D/aSbU73i2cdvfL1pkxBFNcI+LbPcu+sNP9
ugGjPj4etbIRsVysuW3fLhn2kKqaXXVuD1rLTQ+C3ytciI+RQJvG892gWhAABRQ=
=apHk
-----END PGP SIGNATURE-----
Merge tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a couple DM raid and DM mirror fixes
- a couple .request_fn request-based DM NULL pointer fixes
- a fix for a DM target reference count leak, on target load error,
that prevented associated DM target kernel module(s) from being
removed
* tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: fix missing dm_put_target_type() in dm_table_add_target()
dm rq: clear kworker_task if kthread_run() returned an error
dm: free io_barrier after blk_cleanup_queue call
dm raid: fix activation of existing raid4/10 devices
dm mirror: use all available legs on multiple failures
dm mirror: fix read error on recovery after default leg failure
dm raid: fix compat_features validation
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
cleanup_mapped_device() calls kthread_stop() if kworker_task is
non-NULL. Currently the assigned value could be a valid task struct or
an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if
kthread_run() returned an erorr.
Fixes: 7193a9defc ("dm rq: check kthread_run return for .request_fn request-based DM")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>