Make it possible to use precise timestamps with nanosecond granularity
in dm statistics.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the number_of_areas argument was zero the kernel would crash on
div-by-zero. Add better input validation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of
less memory utilization, improved performance and increased adaptability
in the face of changing workloads. SMQ also does not have any
cumbersome tuning knobs.
Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target. Doing so will cause all of the
mq policy's hints to be dropped. Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.
In the future the "mq" policy will just silently make use of "smq" and
the mq code will be removed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The metadata space map has a simplified 'bootstrap' mode that is
operational when extending the space maps. Whilst in this mode it's
possible for some refcount decrement operations to become queued (eg, as
a result of shadowing one of the bitmap indexes). These decrements were
not being applied when switching out of bootstrap mode.
The effect of this bug was the leaking of a 4k metadata block. This is
detected by the latest version of thin_check as a non fatal error.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
In dm_thin_find_block() the ->fail_io flag was checked outside the
metadata device's root_lock, causing dm_thin_find_block() to race with
the setting of this flag.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
send the pool a message. Otherwise usespace is led to believe the
message failed due to invalid argument.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Previously REQ_DISCARD bios have been split into block sized chunks
before submission to the thin target. There are a couple of issues with
this:
- If the block size is small, a large discard request can
get broken up into a great many bios which is both slow and causes
a lot of memory pressure.
- The thin pool block size and the discard granularity for the
underlying data device need to be compatible if we want to passdown
the discard.
This patch relaxes the block size granularity for thin devices. It
makes use of the recent range locking added to the bio_prison to
quiesce a whole range of thin blocks before unmapping them. Once a
thin range has been unmapped the discard can then be passed down to
the data device for those sub ranges where the data blocks are no
longer used (ie. they weren't shared in the first place).
This patch also doesn't make any apologies about open-coding portions
of block core as a means to supporting async discard completions in the
near-term -- if/when late bio splitting lands it'll all get cleaned up.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Retrieve the next run of contiguously mapped blocks. Useful for working
out where to break up IO.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions. However if no IO is going through the
cache, then that bottom half work doesn't occur. With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.
Fix this by introducing a new 'can_block' parameter to the tick()
method. When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode. If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.
Once needs_check is set the cache device will not be allowed to
activate. Activation requires write access to metadata. Future work is
needed to add proper support for running the cache in read-only mode.
Once in fail-io mode the cache will report a status of "Fail".
Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When the cache is idle, writeback work was only being issued every
second. With this change outstanding writebacks are streamed
constantly. This offers a writeback performance improvement.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The stochastic-multi-queue (smq) policy addresses some of the problems
with the current multiqueue (mq) policy.
Memory usage
------------
The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.
SMQ uses 28bit indexes to implement it's data structures rather than
pointers. It avoids storing an explicit hit count for each block. It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).
All these mean smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless.
Level balancing
---------------
MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)). This means the bottom
levels generally have the most entries, and the top ones have very
few. Having unbalanced levels like this reduces the efficacy of the
multiqueue.
SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above. The over all
ordering being a side effect of this stochastic process. With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.
Adaptability
------------
The MQ policy maintains a hit count for each cache block. For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache. This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.
SMQ doesn't maintain hit counts, so a lot of this problem just goes
away. In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly.
Performance
-----------
In my tests SMQ shows substantially better performance than MQ. Once
this matures a bit more I'm sure it'll become the default policy.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).
The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).
This patch passes whole cells across to the worker. This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy. This reduces the overhead of using dm-cache.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rather than always releasing the prisoners in a cell, the client may
want to promote one of them to be the new holder. There is a race here
though between releasing an empty cell, and other threads adding new
inmates. So this function makes the decision with its lock held.
This function can have two outcomes:
i) An inmate is promoted to be the holder of the cell (return value of 0).
ii) The cell has no inmate for promotion and is released (return value of 1).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We only allow non critical writeback if the origin is idle. It is up
to the policy to decide what writeback work is critical.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.
FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.
This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.
Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
A crypto driver can process requests synchronously or asynchronously
and can use an internal driver queue to backlog requests.
Add some comments to clarify internal logic and completion return codes.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Currently if there is a leg failure, the bio will be put into the hold
list until userspace does a remove/replace on the leg. Doing so in a
cluster config (clvmd) is problematic because there may be a temporary
path failure that results in cluster raid1 remove/replace. Such
recovery takes a long time due to a full resync.
Update dm-raid1 to optionally ignore these failures so bios continue
being issued without interrupton. To enable this feature userspace
must pass "keep_log" when creating the dm-raid1 device.
Signed-off-by: Lidong Zhong <lzhong@suse.com>
Tested-by: Liuhua Wang <lwang@suse.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
On 32-bit:
drivers/md/dm-log-writes.c: In function ‘log_super’:
drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type
Add a ULL suffix to WRITE_LOG_MAGIC to fix this.
Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64
field.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add dm-raid access to the MD RAID0 personality to enable single zone
striping.
The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
to allow the raid0 personality to calculate the MD array
size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
(wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
so that userspace checks for 100% sync will succeed and allow
for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
that resulted from bio payloads that grew too large. This problem
did not occur with the other raid levels because it either did not
apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.
Also, backfill dm-raid target version 1.6.0 documentation.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten. As such it is only
relevant when zeroing is enabled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Introduce a single common method for cleaning up a DM device's
mapped_device. No functional change, just eliminates duplication of
delicate mapped_device cleanup code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized). Rename
dm_requeue_unmapped_original_request() to avoid potential confusion due
to function name containing "unmapped".
Also, remove dm_requeue_unmapped_request() since callers can easily call
the dm_requeue_original_request() directly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_merge_bvec() was originally added in f6fccb ("dm: introduce
merge_bvec_fn"). In that commit a value in sectors is converted to
bytes using << 9, and then assigned to an int. This code made
assumptions about the value of BIO_MAX_SECTORS.
A later commit 148e51 ("dm: improve documentation and code clarity in
dm_merge_bvec") was meant to have no functional change but it removed
the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At
this point the cast from sector_t to int resulted in a zero value. The
fallout being dm_merge_bvec() would only allow a single page to be added
to a bio.
This interim fix is minimal for the benefit of stable@ because the more
comprehensive cleanup of passing a sector_t to all DM targets' merge
function will impact quite a few DM targets.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
dm-multipath accepts 0 path mapping.
# echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev
Such a mapping can be used to release underlying devices while still
holding requests in its queue until working paths come back.
However, once the multipath device is created over blk-mq devices,
it rejects reloading of 0 path mapping:
# echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \
| dmsetup create mpath1
# echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1
device-mapper: reload ioctl on mpath1 failed: Invalid argument
Command failed
With following kernel message:
device-mapper: ioctl: can't change device type after initial table load.
DM tries to inherit the current table type using dm_table_set_type()
but it doesn't work as expected because of unnecessary check about
whether the target type is hybrid or not.
Hybrid type is for targets that work as either request-based or bio-based
and not required for blk-mq or non blk-mq checking.
Fixes: 65803c2059 ("dm table: train hybrid target type detection to select blk-mq if appropriate")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When stacking request-based dm device on non blk-mq device and
device-mapper target could not map the request (error target is used,
multipath target with all paths down, etc), the WARN_ON_ONCE() in
free_rq_clone() will trigger when it shouldn't.
The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL
pointer when requeueing unmapped request"). But free_rq_clone() with
clone->q == NULL is valid usage for the case where
dm_kill_unmapped_request() initiates request cleanup.
Fix this false warning by just removing the WARN_ON -- it only generated
false positives and was never useful in catching the intended case
(completing clone request not being mapped e.g. clone->q being NULL).
Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the
DM blk-mq device's .queue_rq. This cleans up the previous convoluted
handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK
(even though it wasn't) and then run blk_mq_requeue_request() followed
by blk_mq_kick_requeue_list().
Also, document that DM blk-mq ontop of old request_fn devices cannot
fail in clone_rq() since the clone request is preallocated as part of
the pdu.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When stacking request-based DM on blk_mq device, request cloning and
remapping are done in a single call to target's clone_and_map_rq().
The clone is allocated and valid only if clone_and_map_rq() returns
DM_MAPIO_REMAPPED.
The "IS_ERR(clone)" check in map_request() does not cover all the
!DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
are not ready or unavailable, clone_and_map_rq() may return
DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this
by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
map_request().
Without this fix, DM core may call setup_clone() for a NULL clone
and oops like this:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137
...
CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1
...
Call Trace:
[<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod]
[<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150
[<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
[<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
[<ffffffff81071fdd>] kthread+0xe6/0xee
[<ffffffff81093a59>] ? put_lock_stats+0xe/0x20
[<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
[<ffffffff814c2d98>] ret_from_fork+0x58/0x90
[<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
Without kicking queue, requeued request may stay forever in
the queue if there are no other I/O activities to the device.
The original error had been in v2.6.39 with commit 7eaceaccab
("block: remove per-queue plugging"), which replaced conditional
plugging by periodic runqueue.
Commit 9d1deb83d4 in v4.1-rc1 removed the periodic runqueue
and the problem started to manifest.
Fixes: 9d1deb83d4 ("dm: don't schedule delayed run of the queue if nothing to do")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This is a set of five fixes: Two MAINTAINER email updates (urgent because the
non-avagotech emails will start bouncing) an lpfc big endian oops fix, a 256
byte sector hang fix (to eliminate 256 byte sectors) and a storvsc fix which
could cause test unit ready failures on bringup.
Signed-off-by: James Bottomley <JBottomley@Odin.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJVYPemAAoJEDeqqVYsXL0MK8UIAJJ8ViLPTfhweb8QgHsUjc4S
5x2jPUD69XT93G81NWKvFhYBnlAvktBZuy6uON7SkLWkbwgiu3pB1Iowqie+Z/h8
1C5VwC8oaT6GZPuakRRxnYmoAaCX7kbdMsibSvc5ouTgvM5FdailhJpkDKcFQlkn
t1GqmRQyH1P54O73af2d+z3AfNfivjoFFXz1YHcUOouiKBVyXzqk1c3CIwTQMelg
B547Rz2cqgdDtgQzZzPkfust/GOYs0ZQMU3rhjeNyIvw/5xa0/Ta2m/v16lckV2I
AvjK2Ht5XSr/Hc9m4yvr5vYCT5+mVrebYMghnUqpHjx1OGTBSU9lPT0exbCCJ/o=
=lXOs
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of five fixes: Two MAINTAINER email updates (urgent
because the non-avagotech emails will start bouncing) an lpfc big
endian oops fix, a 256 byte sector hang fix (to eliminate 256 byte
sectors) and a storvsc fix which could cause test unit ready failures
on bringup"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
MAINTAINERS: Revise lpfc maintainers for Avago Technologies ownership of Emulex
MAINTAINERS, be2iscsi: change email domain
sd: Disable support for 256 byte/sector disks
lpfc: Fix breakage on big endian kernels
storvsc: Set the SRB flags correctly when no data transfer is needed
Pull timer fix from Thomas Gleixner:
"One more fix from the timer departement:
- Handle division of negative nanosecond values proper on 32bit.
A recent cleanup wrecked the sign handling of the dividend and
dropped the check for negative divisors"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ktime: Fix ktime_divns to do signed division
Pull irqchip fix from Thomas Gleixner:
"A fix for a GIC-V3 irqchip regression which prevents some systems from
booting"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gicv3-its: ITS table size should not be smaller than PSZ