Simplifies each DM target's init method by making dm_register_target()
responsible for its error reporting (on behalf of targets).
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Flushing system-wide workqueues is dangerous and will be forbidden.
Use a local workqueue in dm-mpath.c, dm-raid1.c, and dm-stripe.c.
Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.
Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The precision loss of reading IO start_time with jiffies_to_nsecs
instead of using a high resolution timer degrades HST path prediction
for BIO-based mpath on high load workloads.
Below, I show the utilization percentage of a 10 disk multipath with
asymmetrical disk access cost, while being exercised by a randwrite FIO
benchmark with high submission queue depth (depth=64). It is possible
to see that the HST path selection degrades heavily for high-iops in
BIO-mpath, underutilizing the slower paths way beyond expected. This
seems to be caused by the start_time truncation, which makes some IO to
seem much slower than it actually is. In this scenario ST outperforms
HST for bio-mpath, but not for mq-mpath, which already uses ktime_get_ns().
The third column shows utilization with this patch applied. It is easy
to see that now HST prediction is much closer to the ideal distribution
(calculated considering the real cost of each path).
| | ST | HST (orig) | HST(ktime) | Best |
| sdd | 0.17 | 0.20 | 0.17 | 0.18 |
| sde | 0.17 | 0.20 | 0.17 | 0.18 |
| sdf | 0.17 | 0.20 | 0.17 | 0.18 |
| sdg | 0.06 | 0.00 | 0.06 | 0.04 |
| sdh | 0.03 | 0.00 | 0.03 | 0.02 |
| sdi | 0.03 | 0.00 | 0.03 | 0.02 |
| sdj | 0.02 | 0.00 | 0.01 | 0.01 |
| sdk | 0.02 | 0.00 | 0.01 | 0.01 |
| sdl | 0.17 | 0.20 | 0.17 | 0.18 |
| sdm | 0.17 | 0.20 | 0.17 | 0.18 |
This issue was originally discussed [1] when we first merged HST, and
this patch was left as a low hanging fruit to be solved later.
Regarding the implementation, as suggested by Mike in that mail thread,
in order to avoid the overhead of ktime_get_ns for other selectors, this
patch adds a flag for the selector code to request the high-resolution
timer.
I tested this using the same benchmark used in the original HST submission.
Full test and benchmark scripts are available here:
https://people.collabora.com/~krisman/HST-BIO-MPATH/
[1] https://lore.kernel.org/lkml/85tv0am9de.fsf@collabora.com/T/
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
[snitzer: cleaned up various implementation details]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes. The high blast radius core update is the removal of
write same, which affects block and several non-SCSI devices. The
other big change, which is more local, is the removal of the SCSI
pointer.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYjzDQyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQMYAQDEWUGV
6U0+736AHVtOfiMNfiRN79B1HfXVoHvemnPcTwD/UlndwFfy/3GGOtoZmqEpc73J
Ec1HDuUCE18H1H2QAh0=
=/Ty9
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes.
The high blast radius core update is the removal of write same, which
affects block and several non-SCSI devices. The other big change,
which is more local, is the removal of the SCSI pointer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
scsi: scsi_ioctl: Drop needless assignment in sg_io()
scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
scsi: lpfc: Copyright updates for 14.2.0.0 patches
scsi: lpfc: Update lpfc version to 14.2.0.0
scsi: lpfc: SLI path split: Refactor BSG paths
scsi: lpfc: SLI path split: Refactor Abort paths
scsi: lpfc: SLI path split: Refactor SCSI paths
scsi: lpfc: SLI path split: Refactor CT paths
scsi: lpfc: SLI path split: Refactor misc ELS paths
scsi: lpfc: SLI path split: Refactor VMID paths
scsi: lpfc: SLI path split: Refactor FDISC paths
scsi: lpfc: SLI path split: Refactor LS_RJT paths
scsi: lpfc: SLI path split: Refactor LS_ACC paths
scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
scsi: lpfc: SLI path split: Refactor lpfc_iocbq
scsi: lpfc: Use kcalloc()
...
Just use the %pg format specifier instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.
Link: https://lore.kernel.org/r/20220209082828.2629273-7-hch@lst.de
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Just use the disk attached to the request_queue instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MnsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuBpEACzrzbUfkTQ33bwF60mZQaqbR0ha7TrP/hp
oAqthmf1S2U+7mzXHQ+6MN7p4+TVPa/ITxQZtLTw7U/68+w68tTUZfZHJ5H6tSXu
92OHFDDP4ZeqATRTcJBij/5Si9BiKBHexMqeyVYPw0DWdEukAko9f7Z81GonFbTu
EIdIWivBc76bLiK/X3w7lhLcaNyUv9cKalwjbI4xtwcHtcIYj5d2jIc9PF2I9Xtl
3oqNT4GOSv7s3mW7syB1UEPrzbhVIzCSNbMSviCoK7GA5g8EN5KMEGQQoUJ942Zv
bHMjMpGrXsWebPto9maXycGY/9WsVcpNB7opyQRpyG8yDDZq0AFNJxD/NBMkQo4S
Sfp0fxpVXDRWu7zX0EktwGyOp4YNwfS6pDeAhqhnSl2uPWTsxGZ0kXvlMpR9Rt/t
TjEKZe6lmcC7s42rPVRBRw5HEzEsVovf0z4lyvC4M223CV3c5cuYkAAtCcqLdVWq
JkceHSb7EKu7QY6jf3sBud14HaAj+sub7kffOWhhAxObg3Ytsql61AGzbhandnxT
AtN3n9PBHNGmrSv4MiiuP+Dq5jeT5NspFkf1FvnRcfmZMJtH1VXHKr84JbAy4VHr
5cZoDJzL9Zm1d865f+VWkZeYd3b2kKP8C0dm6tAn4VweT6eb8bu6tgB7wFQwLIFK
aRxz5vQ1AQ==
=dLYJ
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block
Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe:
"This contains a series leading to the removal of the
QUEUE_FLAG_SCSI_PASSTHROUGH queue flag"
* tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block:
block: remove blk_{get,put}_request
block: remove QUEUE_FLAG_SCSI_PASSTHROUGH
block: remove the initialize_rq_fn blk_mq_ops method
scsi: add a scsi_alloc_request helper
bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn
nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
sd: implement ->get_unique_id
block: add a ->get_unique_id method
These are now pointless wrappers around blk_mq_{alloc,free}_request,
so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211025070517.1548584-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helpers to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Certain DM targets ('integrity', 'multipath', 'verity') need to update the
way their attributes are recorded in the ima log, so that the attestation
servers can interpret the data correctly and decide if the devices
meet the attestation requirements. For instance, the "mode=%c" attribute
in the 'integrity' target is measured twice, the 'verity' target is
missing the attribute "root_hash_sig_key_desc=%s", and the 'multipath'
target needs to index the attributes properly.
Update 'integrity' target to remove the duplicate measurement of
the attribute "mode=%c". Add "root_hash_sig_key_desc=%s" attribute
for the 'verity' target. Index various attributes in 'multipath'
target. Also, add "nr_priority_groups=%u" attribute to 'multipath'
target to record the number of priority groups.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Suggested-by: Thore Sommer <public@thson.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For device mapper targets to take advantage of IMA's measurement
capabilities, the status functions for the individual targets need to be
updated to handle the status_type_t case for value STATUSTYPE_IMA.
Update status functions for the following target types, to log their
respective attributes to be measured using IMA.
01. cache
02. crypt
03. integrity
04. linear
05. mirror
06. multipath
07. raid
08. snapshot
09. striped
10. verity
For rest of the targets, handle the STATUSTYPE_IMA case by setting the
measurement buffer to NULL.
For IMA to measure the data on a given system, the IMA policy on the
system needs to be updated to have the following line, and the system
needs to be restarted for the measurements to take effect.
/etc/ima/ima-policy
measure func=CRITICAL_DATA label=device-mapper template=ima-buf
The measurements will be reflected in the IMA logs, which are located at:
/sys/kernel/security/integrity/ima/ascii_runtime_measurements
/sys/kernel/security/integrity/ima/binary_runtime_measurements
These IMA logs can later be consumed by various attestation clients
running on the system, and send them to external services for attesting
the system.
The DM target data measured by IMA subsystem can alternatively
be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
DM_TABLE_STATUS_CMD.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 935fcc56ab ("dm mpath: only flush workqueue when needed")
changed flush_multipath_work() to avoid needless workqueue
flushing (of a multipath global workqueue). But that change didn't
realize the surrounding flush_multipath_work() code should also only
run if 'pg_init_in_progress' is set.
Fix this by only doing all of flush_multipath_work()'s PG init related
work if 'pg_init_in_progress' is set.
Otherwise multipath_wait_for_pg_init_completion() will run
unconditionally but the preceeding flush_workqueue(kmpath_handlerd)
may not. This could lead to deadlock (though only if kmpath_handlerd
never runs a corresponding work to decrement 'pg_init_in_progress').
It could also be, though highly unlikely, that the kmpath_handlerd
work that does PG init completes before 'pg_init_in_progress' is set,
and then an intervening DM table reload's multipath_postsuspend()
triggers flush_multipath_work().
Fixes: 935fcc56ab ("dm mpath: only flush workqueue when needed")
Cc: stable@vger.kernel.org
Reported-by: Ben Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
bio-based code so that it follows patterns established by
request-based code.
- Request-based DM core improvement to eliminate unnecessary call to
blk_mq_queue_stopped().
- Add "panic_on_corruption" error handling mode to DM verity target.
- DM bufio fix to to perform buffer cleanup from a workqueue rather
than wait for IO in reclaim context from shrinker.
- DM crypt improvement to optionally avoid async processing via
workqueues for reads and/or writes -- via "no_read_workqueue" and
"no_write_workqueue" features. This more direct IO processing
improves latency and throughput with faster storage. Avoiding
workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
a requirement for adding zoned block device support to DM crypt.
- Add zoned block device support to DM crypt. Makes use of
DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
(DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
encryption to complete. This allows write ordering to be preserved,
which is needed for zoned block devices.
- Fix DM ebs target's check for REQ_OP_FLUSH.
- Fix DM core's report zones support to not report more zones than
were requested.
- A few small compiler warning fixes.
- DM dust improvements to return output directly to the user rather
than require they scrape the system log for output.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl8tdOQTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWvDlB/sF8svagDqeqs27xTxCiUPykD29cMmS
OGPr0Mp/BntZOBpSaTPM9s5XucP3WJhPsxet5qeoyM3OViSFx+O55PqPjn8C65y0
eGMa4zknd9eO1933+ijmyQu6VNr4sf/6nusX4xSGqv00UR22dJ+3pHtfN9ANDXYX
AAYA0Ve6UuOwAbGUCnRGI/2780aYY0B8Ok+cF21CskqryF+RpmbZ6BsR07+Hk4cy
LX5EaHUqezW12cibLq2f0l7TLLJ86OscvqyU9lGVIxiV57e2i5c2S1HvhKZu+Wn3
6CUmlOhGI0viCKgM1ArekZ+zOw9ROIaAKKPzC5mspqx9yuuCqdY8k8xV
=X3tt
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM multipath locking fixes around m->flags tests and improvements to
bio-based code so that it follows patterns established by
request-based code.
- Request-based DM core improvement to eliminate unnecessary call to
blk_mq_queue_stopped().
- Add "panic_on_corruption" error handling mode to DM verity target.
- DM bufio fix to to perform buffer cleanup from a workqueue rather
than wait for IO in reclaim context from shrinker.
- DM crypt improvement to optionally avoid async processing via
workqueues for reads and/or writes -- via "no_read_workqueue" and
"no_write_workqueue" features. This more direct IO processing
improves latency and throughput with faster storage. Avoiding
workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a
requirement for adding zoned block device support to DM crypt.
- Add zoned block device support to DM crypt. Makes use of
DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
(DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
encryption to complete. This allows write ordering to be preserved,
which is needed for zoned block devices.
- Fix DM ebs target's check for REQ_OP_FLUSH.
- Fix DM core's report zones support to not report more zones than were
requested.
- A few small compiler warning fixes.
- DM dust improvements to return output directly to the user rather
than require they scrape the system log for output.
* tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: don't call report zones for more than the user requested
dm ebs: Fix incorrect checking for REQ_OP_FLUSH
dm init: Set file local variable static
dm ioctl: Fix compilation warning
dm raid: Remove empty if statement
dm verity: Fix compilation warning
dm crypt: Enable zoned block device support
dm crypt: add flags to optionally bypass kcryptd workqueues
dm bufio: do buffer cleanup from a workqueue
dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
dm dust: add interface to list all badblocks
dm dust: report some message results directly back to user
dm verity: add "panic_on_corruption" error handling mode
dm mpath: use double checked locking in fast path
dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
dm mpath: rework __map_bio()
dm mpath: factor out multipath_queue_bio
dm mpath: push locking down to must_push_back_rq()
dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
dm mpath: changes from initial m->flags locking audit
Fast-path code biased toward lazy acknowledgement of bit being set
(primarily only for initialization). Multipath code is very retry
oriented so even if state is missed it'll recover.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix multipath_end_io, multipath_end_io_bio and multipath_busy to take
m->lock while testing if MPATHF_QUEUE_IF_NO_PATH bit is set. These are
all slow-path cases when no paths are available so extra locking isn't a
performance hit. Correctness matters most.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When there are many DM multipath devices it really helps to have
additional context for which DM device a failed or reinstated path is
part of.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add more DMDEBUG that shows arguments passed and caller, and another
that shows state of related flags at end of queue_if_no_path().
Also add queue_if_no_path DMDEBUG to multipath_resume().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Do not allow saving disabled queue_if_no_path if already saved as
enabled; implies multiple suspends (which shouldn't ever happen). Log
if this unlikely scenario is ever triggered.
Also, only write MPATHF_SAVED_QUEUE_IF_NO_PATH during presuspend or if
"fail_if_no_path" message. MPATHF_SAVED_QUEUE_IF_NO_PATH is no longer
always modified, e.g.: even if queue_if_no_path()'s save_old_value
argument wasn't set. This just implies a bit tighter control over
the management of MPATHF_SAVED_QUEUE_IF_NO_PATH. Side-effect is
multipath_resume() doesn't reset MPATHF_QUEUE_IF_NO_PATH unless
MPATHF_SAVED_QUEUE_IF_NO_PATH was set (during presuspend); and at that
time the MPATHF_SAVED_QUEUE_IF_NO_PATH bit gets cleared. So
MPATHF_SAVED_QUEUE_IF_NO_PATH's use is much more narrow in scope.
Last, but not least, do _not_ disable queue_if_no_path during noflush
suspend. There is no need/benefit to saving off queue_if_no_path via
MPATHF_SAVED_QUEUE_IF_NO_PATH and clearing MPATHF_QUEUE_IF_NO_PATH for
noflush suspend -- by avoiding this needless queue_if_no_path flag
churn there is less potential for MPATHF_QUEUE_IF_NO_PATH to get lost.
Which avoids potential for IOs to be errored back up to userspace
during DM multipath's handling of path failures.
That said, this last change papers over a reported issue concerning
request-based dm-multipath's interaction with blk-mq, relative to
suspend and resume: multipath_endio is being called _before_
multipath_resume. This should never happen if DM suspend's
blk_mq_quiesce_queue() + dm_wait_for_completion() is genuinely waiting
for all inflight blk-mq requests to complete. Similarly:
drivers/md/dm.c:__dm_resume() clearly calls dm_table_resume_targets()
_before_ dm_start_queue()'s blk_mq_unquiesce_queue() is called. If
the queue isn't even restarted until after multipath_resume(); the BIG
question that still needs answering is: how can multipath_end_io beat
multipath_resume in a race!?
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove micro-optimization that infers device is between presuspend and
resume (was done purely to avoid call to dm_noflush_suspending, which
isn't expensive anyway).
Remove flags argument since they are no longer checked.
And remove must_push_back_bio() since it was simply a call to
__must_push_back().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Now that DMDEBUG uses pr_debug and DMDEBUG_LIMIT uses
pr_debug_ratelimited cleanup DM's 2 direct pr_debug callers to use
them to get the benefit of consistent DM_FMT formatting of debugging
messages.
While doing so, dm-mpath.c:dm_report_EIO() was switched over to using
DMDEBUG_LIMIT due to the potential for error handling floods in the IO
completion path.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The HST path selector needs this information to perform path
prediction. For request-based mpath, struct request's io_start_time_ns
is used, while for bio-based, use the start_time stored in dm_io.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
SCSI LUN passthrough code such as qemu's "scsi-block" device model
pass every IO to the host via SG_IO ioctls. Currently, dm-multipath
calls choose_pgpath() only in the block IO code path, not in the ioctl
code path (unless current_pgpath is NULL). This has the effect that no
path switching and thus no load balancing is done for SCSI-passthrough
IO, unless the active path fails.
Fix this by using the same logic in multipath_prepare_ioctl() as in
multipath_clone_and_map().
Note: The allegedly best path selection algorithm, service-time,
still wouldn't work perfectly, because the io size of the current
request is always set to 0. Changing that for the IO passthrough
case would require the ioctl cmd and arg to be passed to dm's
prepare_ioctl() method.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Changes made during the 5.6 cycle warrant bumping the version number
for DM core and the targets modified by this commit.
It should be noted that dm-thin, dm-crypt and dm-raid already had
their target version bumped during the 5.6 merge window.
Signed-off-by; Mike Snitzer <snitzer@redhat.com>
Add a configurable timeout mechanism to disable queue_if_no_path without
assistance from userspace multipathd. This reimplements multipathd's
no_path_retry mechanism in kernel space. This is motivated by the
desire to prevent processes from hanging indefinitely waiting for IO
in cases where multipathd might be unable to respond (after a failure
or for whatever reason).
Despite replicating userspace multipathd's policy configuration in
kernel space, it is important to prevent IOs from hanging forever,
waiting for userspace that may be incapable of behaving correctly.
Use of the provided "queue_if_no_path_timeout_secs" dm-multipath
module parameter is optional. This timeout mechanism is disabled by
default (by being set to 0).
Signed-off-by: Anatol Pomazau <anatol@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Removes the branching for edge-case where no SCSI device handler
exists. The __map_bio_fast() method was far too limited, by only
selecting a new pathgroup or path IFF there was a path failure, fix this
be eliminating it in favor of __map_bio(). __map_bio()'s extra SCSI
device handler specific MPATHF_PG_INIT_REQUIRED test is not in the fast
path anyway.
This change restores full path selector functionality for bio-based
configurations that don't haave a SCSI device handler. But it should be
noted that the path selectors do have an impact on performance for
certain networks that are extremely fast (and don't require frequent
switching).
Fixes: 8d47e65948 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
Cc: stable@vger.kernel.org
Reported-by: Drew Hastings <dhastings@crucialwebhost.com>
Suggested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit b592211c33 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.
Fixes: b592211c33 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
After commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE. That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.
In fact, even without commit 396eaf21ee, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.
The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice. In addition,
queue-length path selector will also be affected.
To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue. map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.
Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The workqueues are shared by many multipath devices, only flush whole
workqueue when necessary. Otherwise, we just flush works as needed.
Signed-off-by: wuzhouhui <wuzhouhui14@mails.ucas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Now that request-based DM is only using blk-mq, there is no need to
differentiate between legacy "rq" and new "mq". We're back to a single
request-based DM -- and there was much rejoicing!
Signed-off-by: Mike Snitzer <snitzer@redhat.com>