Add an option to dm statistics to collect and report a histogram of
IO latencies.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make it possible to use precise timestamps with nanosecond granularity
in dm statistics.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of
less memory utilization, improved performance and increased adaptability
in the face of changing workloads. SMQ also does not have any
cumbersome tuning knobs.
Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target. Doing so will cause all of the
mq policy's hints to be dropped. Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.
In the future the "mq" policy will just silently make use of "smq" and
the mq code will be removed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode. If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.
Once needs_check is set the cache device will not be allowed to
activate. Activation requires write access to metadata. Future work is
needed to add proper support for running the cache in read-only mode.
Once in fail-io mode the cache will report a status of "Fail".
Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add dm-raid access to the MD RAID0 personality to enable single zone
striping.
The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
to allow the raid0 personality to calculate the MD array
size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
(wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
so that userspace checks for 100% sync will succeed and allow
for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
that resulted from bio payloads that grew too large. This problem
did not occur with the other raid levels because it either did not
apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.
Also, backfill dm-raid target version 1.6.0 documentation.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cryptsetup home page moved to GitLab.
Also remove link to abandonded Truecrypt page.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system. We capture
all write requests and associated data and log them to a separate device
for later replay. There is a userspace utility to do this replay. The
idea behind this is to give file system developers a tool to verify that
the file system is always consistent.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add device specific modes to dm-verity to specify how corrupted
blocks should be handled. The following modes are defined:
- DM_VERITY_MODE_EIO is the default behavior, where reading a
corrupted block results in -EIO.
- DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does
not block the read.
- DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted
block is discovered.
In addition, each mode sends a uevent to notify userspace of
corruption and to allow further recovery actions.
The driver defaults to previous behavior (DM_VERITY_MODE_EIO)
and other modes can be enabled with an additional parameter to
the verity table.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Make it possible to disable offloading writes by setting the optional
'submit_from_crypt_cpus' table argument.
There are some situations where offloading write bios from the
encryption threads to a single thread degrades performance
significantly.
The default is to offload write bios to the same thread because it
benefits CFQ to have writes submitted using the same IO context.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use unbound workqueue by default so that work is automatically balanced
between available CPUs. The original behavior of encrypting using the
same cpu that IO was submitted on can still be enabled by setting the
optional 'same_cpu_crypt' table argument.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Before, if the user wanted sequential IO to be promoted to the cache
they'd have to set sequential_threshold to some nebulous large value.
Now, the user may easily disable sequential IO detection (and sequential
IO's implicit bypass of the cache) by setting sequential_threshold to 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rather than maintaining a separate promote_threshold variable that we
periodically update we now use the hit count of the oldest clean
block. Also add a fudge factor to discourage demoting dirty blocks.
With some tests this has a sizeable difference, because the old code
was too eager to demote blocks. For example, device-mapper-test-suite's
git_extract_cache_quick test goes from taking 190 seconds, to 142
(linear on spindle takes 250).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add support for quickly loading a repetitive pattern into the
dm-switch target.
In the "set_regions_mappings" message, the user may now use "Rn,m" as
one of the arguments. "n" and "m" are hexadecimal numbers. The "Rn,m"
argument repeats the last "n" arguments in the following "m" slots.
For example:
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
Requested-by: Jay Wang <jwang@nimblestorage.com>
Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode
holding IO forever") introduced a fixed 60 second timeout. Users may
want to either disable or modify this timeout.
Allow the out-of-data-space timeout to be configured using the
'no_space_timeout' dm-thin-pool module param. Setting it to 0 will
disable the timeout, resulting in IO being queued until more data space
is added to the thin-pool.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
dm-era is a target that behaves similar to the linear target. In
addition it keeps track of which blocks were written within a user
defined period of time called an 'era'. Each era target instance
maintains the current era as a monotonically increasing 32-bit
counter.
Use cases include tracking changed blocks for backup software, and
partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.
dm-era is primarily expected to be paired with the dm-cache target.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The Documentation for the thin provisioning target's held metadata root
feature was incorrect. It is now available and the value for the held
metadata root is in block units (not 512b sectors).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a thin metadata operation fails the current transaction will abort,
whereby causing potential for IO layers up the stack (e.g. filesystems)
to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
thin metadata's superblock which:
1) requires the user verify the thin metadata is consistent (e.g. use
thin_check, etc)
2) suggests the user verify the thin data is consistent (e.g. use fsck)
The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
to run thin_repair.
On metadata operation failure: abort current metadata transaction, set
pool in read-only mode, and now set the needs_check flag.
As part of this change, constraints are introduced or relaxed:
* don't allow a pool to transition to write mode if needs_check is set
* don't allow data or metadata space to be resized if needs_check is set
* if a thin pool's metadata space is exhausted: the kernel will now
force the user to take the pool offline for repair before the kernel
will allow the metadata space to be extended.
Also, update Documentation to include information about when the thin
provisioning target commits metadata, how it handles metadata failures
and running out of space.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
The cache's policy may have been established using the "default" alias,
which is currently the "mq" policy but the default policy may change in
the future. It is useful to know exactly which policy is being used.
Add a 'real' member to the dm_cache_policy_type structure and have the
"default" dm_cache_policy_type point to the real "mq"
dm_cache_policy_type. Update dm_cache_policy_get_name() to check if
real is set, if so report the name of the real policy (not the alias).
Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Improve cache_status to emit:
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
...
Adding the block sizes allows for easier calculation of the overall size
of both the metadata and cache devices. Adding <#total cache blocks>
provides useful context for how much of the cache is used.
Unfortunately these additions to the status will require updates to
users' scripts that monitor the cache status. But these changes help
provide more comprehensive information about the cache device and will
simplify tools that are being developed to manage dm-cache devices --
because they won't need to issue 3 operations to cobble together the
information that we can easily provide via a single status ioctl.
While updating the status documentation in cache.txt spaces were
tabify'd.
Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache.
This patch introduces three new tunables that allow you to tweak the
promotion threshold by adding a small value. These adjustments depend
on the io type:
read_promote_adjustment: READ io, default 4
write_promote_adjustment: WRITE io, default 8
discard_promote_adjustment: READ/WRITE io to a discarded block, default 1
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pool runs out of data or metadata space, the pool can either
queue or error the IO destined to the data device. The default is to
queue the IO until more space is added.
An admin may now configure the pool to error IO when no space is
available by setting the 'error_if_no_space' feature when loading the
thin-pool table.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The cache target's invalidate_cblocks message allows cache block
(cblock) ranges to be expressed with: <cblock start>-<cblock end>
The range's <cblock end> value is "one past the end", so the range
includes <cblock start> through <cblock end>-1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Document passthrough mode, cache shrinking, and cache invalidation.
Also, use strcasecmp() and hlist_unhashed().
Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cache block invalidation is removing an entry from the cache without
writing it back. Cache blocks can be invalidated via the
'invalidate_cblocks' message, which takes an arbitrary number of cblock
ranges:
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
E.g.
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
"Passthrough" is a dm-cache operating mode (like writethrough or
writeback) which is intended to be used when the cache contents are not
known to be coherent with the origin device. It behaves as follows:
* All reads are served from the origin device (all reads miss the cache)
* All writes are forwarded to the origin device; additionally, write
hits cause cache block invalidates
This mode decouples cache coherency checks from cache device creation,
largely to avoid having to perform coherency checks while booting. Boot
scripts can create cache devices in passthrough mode and put them into
service (mount cached filesystems, for example) without having to worry
about coherency. Coherency that exists is maintained, although the
cache will gradually cool as writes take place.
Later, applications can perform coherency checks, the nature of which
will depend on the type of the underlying storage. If coherency can be
verified, the cache device can be transitioned to writethrough or
writeback mode while still warm; otherwise, the cache contents can be
discarded prior to transitioning to the desired operating mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There are now two multiqueues for in cache blocks. A clean one and a
dirty one.
writeback_work comes from the dirty one. Demotions come from the clean
one.
There are two benefits:
- Performance improvement, since demoting a clean block is a noop.
- The cache cleans itself when io load is light.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-crypt can already activate TCRYPT (TrueCrypt compatible) containers
in LRW or XTS block encryption mode.
TCRYPT containers prior to version 4.1 use CBC mode with some additional
tweaks, this patch adds support for these containers.
This new mode is implemented using special IV generator named TCW
(TrueCrypt IV with whitening). TCW IV only supports containers that are
encrypted with one cipher (Tested with AES, Twofish, Serpent, CAST5 and
TripleDES).
While this mode is legacy and is known to be vulnerable to some
watermarking attacks (e.g. revealing of hidden disk existence) it can
still be useful to activate old containers without using 3rd party
software or for independent forensic analysis of such containers.
(Both the userspace and kernel code is an independent implementation
based on the format documentation and it completely avoids use of
original source code.)
The TCW IV generator uses two additional keys: Kw (whitening seed, size
is always 16 bytes - TCW_WHITENING_SIZE) and Kiv (IV seed, size is
always the IV size of the selected cipher). These keys are concatenated
at the end of the main encryption key provided in mapping table.
While whitening is completely independent from IV, it is implemented
inside IV generator for simplification.
The whitening value is always 16 bytes long and is calculated per sector
from provided Kw as initial seed, xored with sector number and mixed
with CRC32 algorithm. Resulting value is xored with ciphertext sector
content.
IV is calculated from the provided Kiv as initial IV seed and xored with
sector number.
Detailed calculation can be found in the Truecrypt documentation for
version < 4.1 and will also be described on dm-crypt site, see:
http://code.google.com/p/cryptsetup/wiki/DMCrypt
The experimental support for activation of these containers is already
present in git devel brach of cryptsetup.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Support the collection of I/O statistics on user-defined regions of
a DM device. If no regions are defined no statistics are collected so
there isn't any performance impact. Only bio-based DM devices are
currently supported.
Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.
The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
See Documentation/device-mapper/statistics.txt for more details.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 64KB and 1GB. Also, it should be a
multiple of 64KB.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Place upper bound on the cache's data block size (1GB).
Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 32KB and 1GB. Also, it should be a
multiple of 32KB.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
A cache's metadata device may not be shared by multiple cache devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
framework for storage arrays that dynamically reconfigure their
preferred paths for different device regions.
Fix a bug in the verity target that prevented its use with some
specific sizes of devices.
Improve some locking mechanisms in the device-mapper core and bufio.
Add Mike Snitzer as a device-mapper maintainer.
A few more clean-ups and fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJR3ehdAAoJEK2W1qbAHj1nseUP+gPgoX2YTBiKW/fQnbixb11c
0BExXiHtHgVnxQP4aJo8BJRFW9/DAN740UvKb2XjjbNChIQ47j6vOLCCzJ+97wW+
FCJ48pltsacgywvm5e3BbnwmcmpQXKk1Wd+1/9beWbcib9IzVB2B06Esv3HRtQZj
cQbIkeeTGbrSnsiAWSQh2xsNqjv1YObUohs43uG+Pa0WmdE1KebAYfkgEvi0b+E6
ehSsvAMqYRgkLvYdYTxRNJtC+H3pkucS6r42Q/tZj2YciU3tc0v6rsFW9Ey+l0E7
c5KaUAKk5e3HAhFvJ4ydlj7r1cu7G49rixIBJ60lX86QBwmZ8js5EEPliw0ZoWI+
av1P+9gLsxaQTH/Cw8jJW4xK7hYAZAvn//iNVBAATATd65nmQImHNWWMjr205Kw9
9XOeFUxAdnM7ITKXJkFf3vH2tFrRAKgXiR57im5ZuLMOFYWjR6EYE870+GCWSya8
Dhzj0Mb8IFHrelEbRWicNbD5IaAxvfQ6/sTvXBiV642jImkQIyIj+PBiIvsq8fTH
LKNL1l545R5aOHSU4TXnseq3TcIqElx0KsPTJuZq+q/2UfvMe9Lv9g+ld5CywfH1
1HkEB75yWPvEfOtIac9tzQSt3KnF01fC2QMYZE4rSiYs8KPgln9pxo+UulUaZzId
8Gch3/C5cBBCHjMJtv/b
=s5m4
-----END PGP SIGNATURE-----
Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper changes from Alasdair G Kergon:
"Add a device-mapper target called dm-switch to provide a multipath
framework for storage arrays that dynamically reconfigure their
preferred paths for different device regions.
Fix a bug in the verity target that prevented its use with some
specific sizes of devices.
Improve some locking mechanisms in the device-mapper core and bufio.
Add Mike Snitzer as a device-mapper maintainer.
A few more clean-ups and fixes"
* tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm: add switch target
dm: update maintainers
dm: optimize reorder structure
dm: optimize use SRCU and RCU
dm bufio: submit writes outside lock
dm cache: fix arm link errors with inline
dm verity: use __ffs and __fls
dm flakey: correct ctr alloc failure mesg
dm verity: remove pointless comparison
dm: use __GFP_HIGHMEM in __vmalloc
dm verity: fix inability to use a few specific devices sizes
dm ioctl: set noio flag to avoid __vmalloc deadlock
dm mpath: fix ioctl deadlock when no paths
dm-switch is a new target that maps IO to underlying block devices
efficiently when there is a large number of fixed-sized address regions
but there is no simple pattern to allow for a compact mapping
representation such as dm-stripe.
Though we have developed this target for a specific storage device, Dell
EqualLogic, we have made an effort to keep it as general purpose as
possible in the hope that others may benefit.
Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.
Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
MD: Remember the last sync operation that was performed
This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
DM RAID: Add ability to restore transiently failed devices on resume
This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
DM RAID: Add message/status support for changing sync action
This patch adds a message interface to dm-raid to allow the user to more
finely control the sync actions being performed by the MD driver. This
gives the user the ability to initiate "check" and "repair" (i.e. scrubbing).
Two additional fields have been appended to the status output to provide more
information about the type of sync action occurring and the results of those
actions, specifically: <sync_action> and <mismatch_cnt>. These new fields
will always be populated. This is essentially the device-mapper way of doing
what MD controls through the 'sync_action' sysfs file and shows through the
'mismatch_cnt' sysfs file.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
mostly little bugfixes.
Only "feature" is a new RAID10 layout which slightly
improves the number of sets of devices that can concurrently
fail, without data loss.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUTPm+znsnt1WYoG5AQLLsw/+PMqr8roC4twgxTWV1NRbU8NtOcRi9Rj9
uvBS63uYAaLdi/D3UBKFYczmNCu9knuXbcp9SgFDxH7LlthQsWN/GYnif06pPo3w
9Agu5M8c062TJEG1vrnX6FhPO6pNgrWFr3h+CKkTiD3179i9DoQpP8LXQToeyMtI
YRMQf/zCkxYtDvWAP0iwsEWtw8cf+q9I/uGPhQ1L+DnZapXYdbtnqWBRz9q6mrDt
orcGrP41aZHvnOHUaTbwmaorCKkf/Ys4SMaGenrSFpnpQMypt7VgNuwHC59LxvJT
5eiFG/26zIsv7Wk0jv/TvFP5qzUPo0/PFkd5ug0ArvbVRiXS2cMJDwQvMdO1toxD
i5Bb+P9DptadvoWhOTgIpxnG77yRH45wJvyJOk+ZfS1/IO87nCRa3d0yiNOU5e2/
o0VdXPZRr72sdKKTK6kQuYfwCPb+Z2Pz6Q8BJdk6GxlmTXyP6sKhIgwUX86534fE
LrOxfK8qV+GetVu3X02RoX2CyJJRQHXyXmbHuSzXuo/JiOYtDigAydwNZChvf+tf
OoMY9K8vgNbhnGsUG6la7XPvZ+6dZMjdnxp2HB99Ml5A3PWZd75i5T6IHHxIQFbD
C3z9PWTWP+hK4k15DEyjlELtsE9WduGTXG4kUcf328xJ/7lj4VIImVugdCz+1B6z
+HlI6BiLwzY=
=YdVD
-----END PGP SIGNATURE-----
Merge tag 'md-3.9' of git://neil.brown.name/md
Pull md updates from NeilBrown:
"Mostly little bugfixes.
Only "feature" is a new RAID10 layout which slightly improves the
number of sets of devices that can concurrently fail, without data
loss."
* tag 'md-3.9' of git://neil.brown.name/md:
md: expedite metadata update when switching read-auto -> active
md: remove CONFIG_MULTICORE_RAID456
md/raid1,raid10: fix deadlock with freeze_array()
md/raid0: improve error message when converting RAID4-with-spares to RAID0
md: raid0: fix error return from create_stripe_zones.
md: fix two bugs when attempting to resize RAID0 array.
DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
MD RAID10: Minor non-functional code changes
md: raid1,10: Handle REQ_WRITE_SAME flag in write bios
md: protect against crash upon fsync on ro array
A simple cache policy that writes back all data to the origin.
This is used to decommission a dm cache by emptying it.
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
A cache policy that uses a multiqueue ordered by recent hit
count to select which blocks should be promoted and demoted.
This is meant to be a general purpose policy. It prioritises
reads over writes.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.
A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules. The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch. Other modules are
under development.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
Until now, dm-raid.c only supported the "near" algorthm of MD's RAID10
implementation. This patch adds support for the "far" and "offset"
algorithms, but only with the improved redundancy that is brought with
the introduction of the 'use_far_sets' bit, which shifts copied stripes
according to smaller sets vs the entire array. That is, the 17th bit
of the 'layout' variable that defines the RAID10 implementation will
always be set. (More information on how the 'layout' variable selects
the RAID10 algorithm can be found in the opening comments of
drivers/md/raid10.c.)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Before attempting to activate a RAID array, it is checked for sufficient
redundancy. That is, we make sure that there are not too many failed
devices - or devices specified for rebuild - to undermine our ability to
activate the array. The current code performs this check twice - once to
ensure there were not too many devices specified for rebuild by the user
('validate_rebuild_devices') and again after possibly experiencing a failure
to read the superblock ('analyse_superblocks'). Neither of these checks are
sufficient. The first check is done properly but with insufficient
information about the possible failure state of the devices to make a good
determination if the array can be activated. The second check is simply
done wrong in the case of RAID10 because it doesn't account for the
independence of the stripes (i.e. mirror sets). The solution is to use the
properly written check ('validate_rebuild_devices'), but perform the check
after the superblocks have been read and we know which devices have failed.
This gives us one check instead of two and performs it in a location where
it can be done right.
Only RAID10 was affected and it was affected in the following ways:
- the code did not properly catch the condition where a user specified
a device for rebuild that already had a failed device in the same mirror
set. (This condition would, however, be caught at a deeper level in MD.)
- the code triggers a false positive and denies activation when devices in
independent mirror sets have failed - counting the failures as though they
were all in the same set.
The most likely place this error was introduced (or this patch should have
been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
Consequently this fix should also go in v3.7.y, however there is a
small conflict on the .version in raid_target, so I'll submit a
separate patch to -stable.
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
DM RAID: Add code to validate replacement slots for RAID10 arrays
RAID10 can handle 'copies - 1' failures for each mirror group. This code
ensures the user has provided a valid array - one whose devices specified for
rebuild do not exceed the amount of redundancy available.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull md updates from NeilBrown.
* 'for-next' of git://neil.brown.name/md:
DM RAID: Add support for MD RAID10
md/RAID1: Add missing case for attempting to repair known bad blocks.
md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
md/raid1: don't abort a resync on the first badblock.
md: remove duplicated test on ->openers when calling do_md_stop()
raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
md/raid1: prevent merging too large request
md/raid1: read balance chooses idlest disk for SSD
md/raid1: make sequential read detection per disk based
MD RAID10: Export md_raid10_congested
MD: Move macros from raid1*.h to raid1*.c
MD RAID1: rename mirror_info structure
MD RAID10: rename mirror_info structure
MD RAID10: Fix compiler warning.
raid5: add a per-stripe lock
raid5: remove unnecessary bitmap write optimization
raid5: lockless access raid5 overrided bi_phys_segments
raid5: reduce chance release_stripe() taking device_lock