docs: convert docs to ReST and rename to *.rst
The conversion is actually: - add blank lines and indentation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
8ea618899b
commit
f0ba43774c
|
@ -1,3 +1,4 @@
|
|||
=============================
|
||||
Guidance for writing policies
|
||||
=============================
|
||||
|
||||
|
@ -30,7 +31,7 @@ multiqueue (mq)
|
|||
|
||||
This policy is now an alias for smq (see below).
|
||||
|
||||
The following tunables are accepted, but have no effect:
|
||||
The following tunables are accepted, but have no effect::
|
||||
|
||||
'sequential_threshold <#nr_sequential_ios>'
|
||||
'random_threshold <#nr_random_ios>'
|
||||
|
@ -56,7 +57,9 @@ mq policy's hints to be dropped. Also, performance of the cache may
|
|||
degrade slightly until smq recalculates the origin device's hotspots
|
||||
that should be cached.
|
||||
|
||||
Memory usage:
|
||||
Memory usage
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The mq policy used a lot of memory; 88 bytes per cache block on a 64
|
||||
bit machine.
|
||||
|
||||
|
@ -69,7 +72,9 @@ cache block).
|
|||
All this means smq uses ~25bytes per cache block. Still a lot of
|
||||
memory, but a substantial improvement nontheless.
|
||||
|
||||
Level balancing:
|
||||
Level balancing
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
mq placed entries in different levels of the multiqueue structures
|
||||
based on their hit count (~ln(hit count)). This meant the bottom
|
||||
levels generally had the most entries, and the top ones had very
|
||||
|
@ -94,7 +99,9 @@ is used to decide which blocks to promote. If the hotspot queue is
|
|||
performing badly then it starts moving entries more quickly between
|
||||
levels. This lets it adapt to new IO patterns very quickly.
|
||||
|
||||
Performance:
|
||||
Performance
|
||||
^^^^^^^^^^^
|
||||
|
||||
Testing smq shows substantially better performance than mq.
|
||||
|
||||
cleaner
|
||||
|
@ -105,16 +112,19 @@ The cleaner writes back all dirty blocks in a cache to decommission it.
|
|||
Examples
|
||||
========
|
||||
|
||||
The syntax for a table is:
|
||||
The syntax for a table is::
|
||||
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature_args> [<feature arg>]*
|
||||
<policy> <#policy_args> [<policy arg>]*
|
||||
|
||||
The syntax to send a message using the dmsetup command is:
|
||||
The syntax to send a message using the dmsetup command is::
|
||||
|
||||
dmsetup message <mapped device> 0 sequential_threshold 1024
|
||||
dmsetup message <mapped device> 0 random_threshold 8
|
||||
|
||||
Using dmsetup:
|
||||
Using dmsetup::
|
||||
|
||||
dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
|
||||
/dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
|
||||
creates a 128GB large mapped device named 'blah' with the
|
|
@ -1,3 +1,7 @@
|
|||
=====
|
||||
Cache
|
||||
=====
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
|
@ -24,10 +28,13 @@ scenarios (eg. a vm image server).
|
|||
Glossary
|
||||
========
|
||||
|
||||
Migration - Movement of the primary copy of a logical block from one
|
||||
Migration
|
||||
Movement of the primary copy of a logical block from one
|
||||
device to the other.
|
||||
Promotion - Migration from slow device to fast device.
|
||||
Demotion - Migration from fast device to slow device.
|
||||
Promotion
|
||||
Migration from slow device to fast device.
|
||||
Demotion
|
||||
Migration from fast device to slow device.
|
||||
|
||||
The origin device always contains a copy of the logical block, which
|
||||
may be out of date or kept in sync with the copy on the cache device
|
||||
|
@ -169,45 +176,53 @@ Target interface
|
|||
Constructor
|
||||
-----------
|
||||
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature args> [<feature arg>]*
|
||||
<policy> <#policy args> [policy args]*
|
||||
::
|
||||
|
||||
metadata dev : fast device holding the persistent metadata
|
||||
cache dev : fast device holding cached data blocks
|
||||
origin dev : slow device holding original data blocks
|
||||
block size : cache unit size in sectors
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature args> [<feature arg>]*
|
||||
<policy> <#policy args> [policy args]*
|
||||
|
||||
#feature args : number of feature arguments passed
|
||||
feature args : writethrough or passthrough (The default is writeback.)
|
||||
================ =======================================================
|
||||
metadata dev fast device holding the persistent metadata
|
||||
cache dev fast device holding cached data blocks
|
||||
origin dev slow device holding original data blocks
|
||||
block size cache unit size in sectors
|
||||
|
||||
policy : the replacement policy to use
|
||||
#policy args : an even number of arguments corresponding to
|
||||
key/value pairs passed to the policy
|
||||
policy args : key/value pairs passed to the policy
|
||||
E.g. 'sequential_threshold 1024'
|
||||
See cache-policies.txt for details.
|
||||
#feature args number of feature arguments passed
|
||||
feature args writethrough or passthrough (The default is writeback.)
|
||||
|
||||
policy the replacement policy to use
|
||||
#policy args an even number of arguments corresponding to
|
||||
key/value pairs passed to the policy
|
||||
policy args key/value pairs passed to the policy
|
||||
E.g. 'sequential_threshold 1024'
|
||||
See cache-policies.txt for details.
|
||||
================ =======================================================
|
||||
|
||||
Optional feature arguments are:
|
||||
writethrough : write through caching that prohibits cache block
|
||||
content from being different from origin block content.
|
||||
Without this argument, the default behaviour is to write
|
||||
back cache block contents later for performance reasons,
|
||||
so they may differ from the corresponding origin blocks.
|
||||
|
||||
passthrough : a degraded mode useful for various cache coherency
|
||||
situations (e.g., rolling back snapshots of
|
||||
underlying storage). Reads and writes always go to
|
||||
the origin. If a write goes to a cached origin
|
||||
block, then the cache block is invalidated.
|
||||
To enable passthrough mode the cache must be clean.
|
||||
|
||||
metadata2 : use version 2 of the metadata. This stores the dirty bits
|
||||
in a separate btree, which improves speed of shutting
|
||||
down the cache.
|
||||
==================== ========================================================
|
||||
writethrough write through caching that prohibits cache block
|
||||
content from being different from origin block content.
|
||||
Without this argument, the default behaviour is to write
|
||||
back cache block contents later for performance reasons,
|
||||
so they may differ from the corresponding origin blocks.
|
||||
|
||||
no_discard_passdown : disable passing down discards from the cache
|
||||
to the origin's data device.
|
||||
passthrough a degraded mode useful for various cache coherency
|
||||
situations (e.g., rolling back snapshots of
|
||||
underlying storage). Reads and writes always go to
|
||||
the origin. If a write goes to a cached origin
|
||||
block, then the cache block is invalidated.
|
||||
To enable passthrough mode the cache must be clean.
|
||||
|
||||
metadata2 use version 2 of the metadata. This stores the dirty
|
||||
bits in a separate btree, which improves speed of
|
||||
shutting down the cache.
|
||||
|
||||
no_discard_passdown disable passing down discards from the cache
|
||||
to the origin's data device.
|
||||
==================== ========================================================
|
||||
|
||||
A policy called 'default' is always registered. This is an alias for
|
||||
the policy we currently think is giving best all round performance.
|
||||
|
@ -218,54 +233,61 @@ the characteristics of a specific policy, always request it by name.
|
|||
Status
|
||||
------
|
||||
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<cache block size> <#used cache blocks>/<#total cache blocks>
|
||||
<#read hits> <#read misses> <#write hits> <#write misses>
|
||||
<#demotions> <#promotions> <#dirty> <#features> <features>*
|
||||
<#core args> <core args>* <policy name> <#policy args> <policy args>*
|
||||
<cache metadata mode>
|
||||
::
|
||||
|
||||
metadata block size : Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks : Number of metadata blocks used
|
||||
#total metadata blocks : Total number of metadata blocks
|
||||
cache block size : Configurable block size for the cache device
|
||||
in sectors
|
||||
#used cache blocks : Number of blocks resident in the cache
|
||||
#total cache blocks : Total number of cache blocks
|
||||
#read hits : Number of times a READ bio has been mapped
|
||||
to the cache
|
||||
#read misses : Number of times a READ bio has been mapped
|
||||
to the origin
|
||||
#write hits : Number of times a WRITE bio has been mapped
|
||||
to the cache
|
||||
#write misses : Number of times a WRITE bio has been
|
||||
mapped to the origin
|
||||
#demotions : Number of times a block has been removed
|
||||
from the cache
|
||||
#promotions : Number of times a block has been moved to
|
||||
the cache
|
||||
#dirty : Number of blocks in the cache that differ
|
||||
from the origin
|
||||
#feature args : Number of feature args to follow
|
||||
feature args : 'writethrough' (optional)
|
||||
#core args : Number of core arguments (must be even)
|
||||
core args : Key/value pairs for tuning the core
|
||||
e.g. migration_threshold
|
||||
policy name : Name of the policy
|
||||
#policy args : Number of policy arguments to follow (must be even)
|
||||
policy args : Key/value pairs e.g. sequential_threshold
|
||||
cache metadata mode : ro if read-only, rw if read-write
|
||||
In serious cases where even a read-only mode is deemed unsafe
|
||||
no further I/O will be permitted and the status will just
|
||||
contain the string 'Fail'. The userspace recovery tools
|
||||
should then be used.
|
||||
needs_check : 'needs_check' if set, '-' if not set
|
||||
A metadata operation has failed, resulting in the needs_check
|
||||
flag being set in the metadata's superblock. The metadata
|
||||
device must be deactivated and checked/repaired before the
|
||||
cache can be made fully operational again. '-' indicates
|
||||
needs_check is not set.
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<cache block size> <#used cache blocks>/<#total cache blocks>
|
||||
<#read hits> <#read misses> <#write hits> <#write misses>
|
||||
<#demotions> <#promotions> <#dirty> <#features> <features>*
|
||||
<#core args> <core args>* <policy name> <#policy args> <policy args>*
|
||||
<cache metadata mode>
|
||||
|
||||
|
||||
========================= =====================================================
|
||||
metadata block size Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks Number of metadata blocks used
|
||||
#total metadata blocks Total number of metadata blocks
|
||||
cache block size Configurable block size for the cache device
|
||||
in sectors
|
||||
#used cache blocks Number of blocks resident in the cache
|
||||
#total cache blocks Total number of cache blocks
|
||||
#read hits Number of times a READ bio has been mapped
|
||||
to the cache
|
||||
#read misses Number of times a READ bio has been mapped
|
||||
to the origin
|
||||
#write hits Number of times a WRITE bio has been mapped
|
||||
to the cache
|
||||
#write misses Number of times a WRITE bio has been
|
||||
mapped to the origin
|
||||
#demotions Number of times a block has been removed
|
||||
from the cache
|
||||
#promotions Number of times a block has been moved to
|
||||
the cache
|
||||
#dirty Number of blocks in the cache that differ
|
||||
from the origin
|
||||
#feature args Number of feature args to follow
|
||||
feature args 'writethrough' (optional)
|
||||
#core args Number of core arguments (must be even)
|
||||
core args Key/value pairs for tuning the core
|
||||
e.g. migration_threshold
|
||||
policy name Name of the policy
|
||||
#policy args Number of policy arguments to follow (must be even)
|
||||
policy args Key/value pairs e.g. sequential_threshold
|
||||
cache metadata mode ro if read-only, rw if read-write
|
||||
|
||||
In serious cases where even a read-only mode is
|
||||
deemed unsafe no further I/O will be permitted and
|
||||
the status will just contain the string 'Fail'.
|
||||
The userspace recovery tools should then be used.
|
||||
needs_check 'needs_check' if set, '-' if not set
|
||||
A metadata operation has failed, resulting in the
|
||||
needs_check flag being set in the metadata's
|
||||
superblock. The metadata device must be
|
||||
deactivated and checked/repaired before the
|
||||
cache can be made fully operational again.
|
||||
'-' indicates needs_check is not set.
|
||||
========================= =====================================================
|
||||
|
||||
Messages
|
||||
--------
|
||||
|
@ -274,11 +296,12 @@ Policies will have different tunables, specific to each one, so we
|
|||
need a generic way of getting and setting these. Device-mapper
|
||||
messages are used. (A sysfs interface would also be possible.)
|
||||
|
||||
The message format is:
|
||||
The message format is::
|
||||
|
||||
<key> <value>
|
||||
|
||||
E.g.
|
||||
E.g.::
|
||||
|
||||
dmsetup message my_cache 0 sequential_threshold 1024
|
||||
|
||||
|
||||
|
@ -290,11 +313,12 @@ of values from 5 to 9. Each cblock must be expressed as a decimal
|
|||
value, in the future a variant message that takes cblock ranges
|
||||
expressed in hexadecimal may be needed to better support efficient
|
||||
invalidation of larger caches. The cache must be in passthrough mode
|
||||
when invalidate_cblocks is used.
|
||||
when invalidate_cblocks is used::
|
||||
|
||||
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
|
||||
|
||||
E.g.
|
||||
E.g.::
|
||||
|
||||
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
|
||||
|
||||
Examples
|
||||
|
@ -304,8 +328,10 @@ The test suite can be found here:
|
|||
|
||||
https://github.com/jthornber/device-mapper-test-suite
|
||||
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
|
||||
mq 4 sequential_threshold 1024 random_threshold 8'
|
||||
::
|
||||
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
|
||||
mq 4 sequential_threshold 1024 random_threshold 8'
|
|
@ -1,10 +1,12 @@
|
|||
========
|
||||
dm-delay
|
||||
========
|
||||
|
||||
Device-Mapper's "delay" target delays reads and/or writes
|
||||
and maps them to different devices.
|
||||
|
||||
Parameters:
|
||||
Parameters::
|
||||
|
||||
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>
|
||||
[<flush_device> <flush_offset> <flush_delay>]]
|
||||
|
||||
|
@ -14,15 +16,16 @@ Delays are specified in milliseconds.
|
|||
|
||||
Example scripts
|
||||
===============
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create device delaying rw operation for 500ms
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
|
||||
]]
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create device delaying only write operation for 500ms and
|
||||
# splitting reads and writes to different devices $1 $2
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
|
||||
]]
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create device delaying rw operation for 500ms
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create device delaying only write operation for 500ms and
|
||||
# splitting reads and writes to different devices $1 $2
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
|
|
@ -1,5 +1,6 @@
|
|||
========
|
||||
dm-crypt
|
||||
=========
|
||||
========
|
||||
|
||||
Device-Mapper's "crypt" target provides transparent encryption of block devices
|
||||
using the kernel crypto API.
|
||||
|
@ -7,15 +8,20 @@ using the kernel crypto API.
|
|||
For a more detailed description of supported parameters see:
|
||||
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
|
||||
|
||||
Parameters: <cipher> <key> <iv_offset> <device path> \
|
||||
Parameters::
|
||||
|
||||
<cipher> <key> <iv_offset> <device path> \
|
||||
<offset> [<#opt_params> <opt_params>]
|
||||
|
||||
<cipher>
|
||||
Encryption cipher, encryption mode and Initial Vector (IV) generator.
|
||||
|
||||
The cipher specifications format is:
|
||||
The cipher specifications format is::
|
||||
|
||||
cipher[:keycount]-chainmode-ivmode[:ivopts]
|
||||
Examples:
|
||||
|
||||
Examples::
|
||||
|
||||
aes-cbc-essiv:sha256
|
||||
aes-xts-plain64
|
||||
serpent-xts-plain64
|
||||
|
@ -25,12 +31,17 @@ Parameters: <cipher> <key> <iv_offset> <device path> \
|
|||
as for the first format type.
|
||||
This format is mainly used for specification of authenticated modes.
|
||||
|
||||
The crypto API cipher specifications format is:
|
||||
The crypto API cipher specifications format is::
|
||||
|
||||
capi:cipher_api_spec-ivmode[:ivopts]
|
||||
Examples:
|
||||
|
||||
Examples::
|
||||
|
||||
capi:cbc(aes)-essiv:sha256
|
||||
capi:xts(aes)-plain64
|
||||
Examples of authenticated modes:
|
||||
|
||||
Examples of authenticated modes::
|
||||
|
||||
capi:gcm(aes)-random
|
||||
capi:authenc(hmac(sha256),xts(aes))-random
|
||||
capi:rfc7539(chacha20,poly1305)-random
|
||||
|
@ -142,21 +153,21 @@ LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
|
|||
encryption with dm-crypt using the 'cryptsetup' utility, see
|
||||
https://gitlab.com/cryptsetup/cryptsetup
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup
|
||||
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
|
||||
]]
|
||||
::
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup when encryption key is stored in keyring service
|
||||
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
|
||||
]]
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup
|
||||
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create a crypt device using cryptsetup and LUKS header with default cipher
|
||||
cryptsetup luksFormat $1
|
||||
cryptsetup luksOpen $1 crypt1
|
||||
]]
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup when encryption key is stored in keyring service
|
||||
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using cryptsetup and LUKS header with default cipher
|
||||
cryptsetup luksFormat $1
|
||||
cryptsetup luksOpen $1 crypt1
|
|
@ -1,3 +1,4 @@
|
|||
=========
|
||||
dm-flakey
|
||||
=========
|
||||
|
||||
|
@ -15,17 +16,26 @@ underlying devices.
|
|||
|
||||
Table parameters
|
||||
----------------
|
||||
|
||||
::
|
||||
|
||||
<dev path> <offset> <up interval> <down interval> \
|
||||
[<num_features> [<feature arguments>]]
|
||||
|
||||
Mandatory parameters:
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
<up interval>: Number of seconds device is available.
|
||||
<down interval>: Number of seconds device returns errors.
|
||||
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
<up interval>:
|
||||
Number of seconds device is available.
|
||||
<down interval>:
|
||||
Number of seconds device returns errors.
|
||||
|
||||
Optional feature parameters:
|
||||
|
||||
If no feature parameters are present, during the periods of
|
||||
unreliability, all I/O returns errors.
|
||||
|
||||
|
@ -41,17 +51,24 @@ Optional feature parameters:
|
|||
During <down interval>, replace <Nth_byte> of the data of
|
||||
each matching bio with <value>.
|
||||
|
||||
<Nth_byte>: The offset of the byte to replace.
|
||||
Counting starts at 1, to replace the first byte.
|
||||
<direction>: Either 'r' to corrupt reads or 'w' to corrupt writes.
|
||||
'w' is incompatible with drop_writes.
|
||||
<value>: The value (from 0-255) to write.
|
||||
<flags>: Perform the replacement only if bio->bi_opf has all the
|
||||
selected flags set.
|
||||
<Nth_byte>:
|
||||
The offset of the byte to replace.
|
||||
Counting starts at 1, to replace the first byte.
|
||||
<direction>:
|
||||
Either 'r' to corrupt reads or 'w' to corrupt writes.
|
||||
'w' is incompatible with drop_writes.
|
||||
<value>:
|
||||
The value (from 0-255) to write.
|
||||
<flags>:
|
||||
Perform the replacement only if bio->bi_opf has all the
|
||||
selected flags set.
|
||||
|
||||
Examples:
|
||||
|
||||
Replaces the 32nd byte of READ bios with the value 1::
|
||||
|
||||
corrupt_bio_byte 32 r 1 0
|
||||
- replaces the 32nd byte of READ bios with the value 1
|
||||
|
||||
Replaces the 224th byte of REQ_META (=32) bios with the value 0::
|
||||
|
||||
corrupt_bio_byte 224 w 0 32
|
||||
- replaces the 224th byte of REQ_META (=32) bios with the value 0
|
|
@ -1,5 +1,6 @@
|
|||
================================
|
||||
Early creation of mapped devices
|
||||
====================================
|
||||
================================
|
||||
|
||||
It is possible to configure a device-mapper device to act as the root device for
|
||||
your system in two ways.
|
||||
|
@ -12,15 +13,17 @@ The second is to create one or more device-mappers using the module parameter
|
|||
|
||||
The format is specified as a string of data separated by commas and optionally
|
||||
semi-colons, where:
|
||||
|
||||
- a comma is used to separate fields like name, uuid, flags and table
|
||||
(specifies one device)
|
||||
- a semi-colon is used to separate devices.
|
||||
|
||||
So the format will look like this:
|
||||
So the format will look like this::
|
||||
|
||||
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
|
||||
|
||||
Where,
|
||||
Where::
|
||||
|
||||
<name> ::= The device name.
|
||||
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
|
||||
<minor> ::= The device minor number | ""
|
||||
|
@ -29,7 +32,7 @@ Where,
|
|||
<target_type> ::= "verity" | "linear" | ... (see list below)
|
||||
|
||||
The dm line should be equivalent to the one used by the dmsetup tool with the
|
||||
--concise argument.
|
||||
`--concise` argument.
|
||||
|
||||
Target types
|
||||
============
|
||||
|
@ -38,32 +41,34 @@ Not all target types are available as there are serious risks in allowing
|
|||
activation of certain DM targets without first using userspace tools to check
|
||||
the validity of associated metadata.
|
||||
|
||||
"cache": constrained, userspace should verify cache device
|
||||
"crypt": allowed
|
||||
"delay": allowed
|
||||
"era": constrained, userspace should verify metadata device
|
||||
"flakey": constrained, meant for test
|
||||
"linear": allowed
|
||||
"log-writes": constrained, userspace should verify metadata device
|
||||
"mirror": constrained, userspace should verify main/mirror device
|
||||
"raid": constrained, userspace should verify metadata device
|
||||
"snapshot": constrained, userspace should verify src/dst device
|
||||
"snapshot-origin": allowed
|
||||
"snapshot-merge": constrained, userspace should verify src/dst device
|
||||
"striped": allowed
|
||||
"switch": constrained, userspace should verify dev path
|
||||
"thin": constrained, requires dm target message from userspace
|
||||
"thin-pool": constrained, requires dm target message from userspace
|
||||
"verity": allowed
|
||||
"writecache": constrained, userspace should verify cache device
|
||||
"zero": constrained, not meant for rootfs
|
||||
======================= =======================================================
|
||||
`cache` constrained, userspace should verify cache device
|
||||
`crypt` allowed
|
||||
`delay` allowed
|
||||
`era` constrained, userspace should verify metadata device
|
||||
`flakey` constrained, meant for test
|
||||
`linear` allowed
|
||||
`log-writes` constrained, userspace should verify metadata device
|
||||
`mirror` constrained, userspace should verify main/mirror device
|
||||
`raid` constrained, userspace should verify metadata device
|
||||
`snapshot` constrained, userspace should verify src/dst device
|
||||
`snapshot-origin` allowed
|
||||
`snapshot-merge` constrained, userspace should verify src/dst device
|
||||
`striped` allowed
|
||||
`switch` constrained, userspace should verify dev path
|
||||
`thin` constrained, requires dm target message from userspace
|
||||
`thin-pool` constrained, requires dm target message from userspace
|
||||
`verity` allowed
|
||||
`writecache` constrained, userspace should verify cache device
|
||||
`zero` constrained, not meant for rootfs
|
||||
======================= =======================================================
|
||||
|
||||
If the target is not listed above, it is constrained by default (not tested).
|
||||
|
||||
Examples
|
||||
========
|
||||
An example of booting to a linear array made up of user-mode linux block
|
||||
devices:
|
||||
devices::
|
||||
|
||||
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
|
||||
|
||||
|
@ -71,8 +76,8 @@ This will boot to a rw dm-linear target of 8192 sectors split across two block
|
|||
devices identified by their major:minor numbers. After boot, udev will rename
|
||||
this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
|
||||
|
||||
An example of multiple device-mappers, with the dm-mod.create="..." contents is shown here
|
||||
split on multiple lines for readability:
|
||||
An example of multiple device-mappers, with the dm-mod.create="..." contents
|
||||
is shown here split on multiple lines for readability::
|
||||
|
||||
dm-linear,,1,rw,
|
||||
0 32768 linear 8:1 0,
|
||||
|
@ -84,30 +89,36 @@ split on multiple lines for readability:
|
|||
|
||||
Other examples (per target):
|
||||
|
||||
"crypt":
|
||||
"crypt"::
|
||||
|
||||
dm-crypt,,8,ro,
|
||||
0 1048576 crypt aes-xts-plain64
|
||||
babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
|
||||
/dev/sda 0 1 allow_discards
|
||||
|
||||
"delay":
|
||||
"delay"::
|
||||
|
||||
dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
|
||||
|
||||
"linear":
|
||||
"linear"::
|
||||
|
||||
dm-linear,,,rw,
|
||||
0 32768 linear /dev/sda1 0,
|
||||
32768 1024000 linear /dev/sda2 0,
|
||||
1056768 204800 linear /dev/sda3 0,
|
||||
1261568 512000 linear /dev/sda4 0
|
||||
|
||||
"snapshot-origin":
|
||||
"snapshot-origin"::
|
||||
|
||||
dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
|
||||
|
||||
"striped":
|
||||
"striped"::
|
||||
|
||||
dm-striped,,4,ro,0 1638400 striped 4 4096
|
||||
/dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
|
||||
|
||||
"verity":
|
||||
"verity"::
|
||||
|
||||
dm-verity,,4,ro,
|
||||
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
|
||||
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
|
|
@ -1,3 +1,7 @@
|
|||
============
|
||||
dm-integrity
|
||||
============
|
||||
|
||||
The dm-integrity target emulates a block device that has additional
|
||||
per-sector tags that can be used for storing integrity information.
|
||||
|
||||
|
@ -35,15 +39,16 @@ zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
|
|||
target can't be loaded.
|
||||
|
||||
To use the target for the first time:
|
||||
|
||||
1. overwrite the superblock with zeroes
|
||||
2. load the dm-integrity target with one-sector size, the kernel driver
|
||||
will format the device
|
||||
will format the device
|
||||
3. unload the dm-integrity target
|
||||
4. read the "provided_data_sectors" value from the superblock
|
||||
5. load the dm-integrity target with the the target size
|
||||
"provided_data_sectors"
|
||||
"provided_data_sectors"
|
||||
6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
|
||||
with the size "provided_data_sectors"
|
||||
with the size "provided_data_sectors"
|
||||
|
||||
|
||||
Target arguments:
|
||||
|
@ -51,17 +56,20 @@ Target arguments:
|
|||
1. the underlying block device
|
||||
|
||||
2. the number of reserved sector at the beginning of the device - the
|
||||
dm-integrity won't read of write these sectors
|
||||
dm-integrity won't read of write these sectors
|
||||
|
||||
3. the size of the integrity tag (if "-" is used, the size is taken from
|
||||
the internal-hash algorithm)
|
||||
the internal-hash algorithm)
|
||||
|
||||
4. mode:
|
||||
D - direct writes (without journal) - in this mode, journaling is
|
||||
|
||||
D - direct writes (without journal)
|
||||
in this mode, journaling is
|
||||
not used and data sectors and integrity tags are written
|
||||
separately. In case of crash, it is possible that the data
|
||||
and integrity tag doesn't match.
|
||||
J - journaled writes - data and integrity tags are written to the
|
||||
J - journaled writes
|
||||
data and integrity tags are written to the
|
||||
journal and atomicity is guaranteed. In case of crash,
|
||||
either both data and tag or none of them are written. The
|
||||
journaled mode degrades write throughput twice because the
|
||||
|
@ -178,9 +186,12 @@ and the reloaded target would be non-functional.
|
|||
|
||||
|
||||
The layout of the formatted block device:
|
||||
* reserved sectors (they are not used by this target, they can be used for
|
||||
storing LUKS metadata or for other purpose), the size of the reserved
|
||||
area is specified in the target arguments
|
||||
|
||||
* reserved sectors
|
||||
(they are not used by this target, they can be used for
|
||||
storing LUKS metadata or for other purpose), the size of the reserved
|
||||
area is specified in the target arguments
|
||||
|
||||
* superblock (4kiB)
|
||||
* magic string - identifies that the device was formatted
|
||||
* version
|
||||
|
@ -192,40 +203,55 @@ The layout of the formatted block device:
|
|||
metadata and padding). The user of this target should not send
|
||||
bios that access data beyond the "provided data sectors" limit.
|
||||
* flags
|
||||
SB_FLAG_HAVE_JOURNAL_MAC - a flag is set if journal_mac is used
|
||||
SB_FLAG_RECALCULATING - recalculating is in progress
|
||||
SB_FLAG_DIRTY_BITMAP - journal area contains the bitmap of dirty
|
||||
blocks
|
||||
SB_FLAG_HAVE_JOURNAL_MAC
|
||||
- a flag is set if journal_mac is used
|
||||
SB_FLAG_RECALCULATING
|
||||
- recalculating is in progress
|
||||
SB_FLAG_DIRTY_BITMAP
|
||||
- journal area contains the bitmap of dirty
|
||||
blocks
|
||||
* log2(sectors per block)
|
||||
* a position where recalculating finished
|
||||
* journal
|
||||
The journal is divided into sections, each section contains:
|
||||
|
||||
* metadata area (4kiB), it contains journal entries
|
||||
every journal entry contains:
|
||||
|
||||
- every journal entry contains:
|
||||
|
||||
* logical sector (specifies where the data and tag should
|
||||
be written)
|
||||
* last 8 bytes of data
|
||||
* integrity tag (the size is specified in the superblock)
|
||||
every metadata sector ends with
|
||||
|
||||
- every metadata sector ends with
|
||||
|
||||
* mac (8-bytes), all the macs in 8 metadata sectors form a
|
||||
64-byte value. It is used to store hmac of sector
|
||||
numbers in the journal section, to protect against a
|
||||
possibility that the attacker tampers with sector
|
||||
numbers in the journal.
|
||||
* commit id
|
||||
|
||||
* data area (the size is variable; it depends on how many journal
|
||||
entries fit into the metadata area)
|
||||
every sector in the data area contains:
|
||||
|
||||
- every sector in the data area contains:
|
||||
|
||||
* data (504 bytes of data, the last 8 bytes are stored in
|
||||
the journal entry)
|
||||
* commit id
|
||||
|
||||
To test if the whole journal section was written correctly, every
|
||||
512-byte sector of the journal ends with 8-byte commit id. If the
|
||||
commit id matches on all sectors in a journal section, then it is
|
||||
assumed that the section was written correctly. If the commit id
|
||||
doesn't match, the section was written partially and it should not
|
||||
be replayed.
|
||||
* one or more runs of interleaved tags and data. Each run contains:
|
||||
|
||||
* one or more runs of interleaved tags and data.
|
||||
Each run contains:
|
||||
|
||||
* tag area - it contains integrity tags. There is one tag for each
|
||||
sector in the data area
|
||||
* data area - it contains data sectors. The number of data sectors
|
|
@ -1,3 +1,4 @@
|
|||
=====
|
||||
dm-io
|
||||
=====
|
||||
|
||||
|
@ -7,7 +8,7 @@ version.
|
|||
|
||||
The user must set up an io_region structure to describe the desired location
|
||||
of the I/O. Each io_region indicates a block-device along with the starting
|
||||
sector and size of the region.
|
||||
sector and size of the region::
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
|
@ -19,7 +20,7 @@ Dm-io can read from one io_region or write to one or more io_regions. Writes
|
|||
to multiple regions are specified by an array of io_region structures.
|
||||
|
||||
The first I/O service type takes a list of memory pages as the data buffer for
|
||||
the I/O, along with an offset into the first page.
|
||||
the I/O, along with an offset into the first page::
|
||||
|
||||
struct page_list {
|
||||
struct page_list *next;
|
||||
|
@ -35,7 +36,7 @@ the I/O, along with an offset into the first page.
|
|||
|
||||
The second I/O service type takes an array of bio vectors as the data buffer
|
||||
for the I/O. This service can be handy if the caller has a pre-assembled bio,
|
||||
but wants to direct different portions of the bio to different devices.
|
||||
but wants to direct different portions of the bio to different devices::
|
||||
|
||||
int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
|
||||
int rw, struct bio_vec *bvec,
|
||||
|
@ -47,7 +48,7 @@ but wants to direct different portions of the bio to different devices.
|
|||
The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
|
||||
data buffer for the I/O. This service can be handy if the caller needs to do
|
||||
I/O to a large region but doesn't want to allocate a large number of individual
|
||||
memory pages.
|
||||
memory pages::
|
||||
|
||||
int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
|
||||
void *data, unsigned long *error_bits);
|
||||
|
@ -55,11 +56,11 @@ memory pages.
|
|||
void *data, io_notify_fn fn, void *context);
|
||||
|
||||
Callers of the asynchronous I/O services must include the name of a completion
|
||||
callback routine and a pointer to some context data for the I/O.
|
||||
callback routine and a pointer to some context data for the I/O::
|
||||
|
||||
typedef void (*io_notify_fn)(unsigned long error, void *context);
|
||||
|
||||
The "error" parameter in this callback, as well as the "*error" parameter in
|
||||
The "error" parameter in this callback, as well as the `*error` parameter in
|
||||
all of the synchronous versions, is a bitset (instead of a simple error value).
|
||||
In the case of an write-I/O to multiple regions, this bitset allows dm-io to
|
||||
indicate success or failure on each individual region.
|
||||
|
@ -72,4 +73,3 @@ always available in order to avoid unnecessary waiting while performing I/O.
|
|||
When the user is finished using the dm-io services, they should call
|
||||
dm_io_put() and specify the same number of pages that were given on the
|
||||
dm_io_get() call.
|
||||
|
|
@ -1,3 +1,4 @@
|
|||
=====================
|
||||
Device-Mapper Logging
|
||||
=====================
|
||||
The device-mapper logging code is used by some of the device-mapper
|
||||
|
@ -16,11 +17,13 @@ dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
|
|||
logging implementations are available and provide different
|
||||
capabilities. The list includes:
|
||||
|
||||
============== ==============================================================
|
||||
Type Files
|
||||
==== =====
|
||||
============== ==============================================================
|
||||
disk drivers/md/dm-log.c
|
||||
core drivers/md/dm-log.c
|
||||
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
|
||||
============== ==============================================================
|
||||
|
||||
The "disk" log type
|
||||
-------------------
|
|
@ -1,3 +1,4 @@
|
|||
===============
|
||||
dm-queue-length
|
||||
===============
|
||||
|
||||
|
@ -6,12 +7,18 @@ which selects a path with the least number of in-flight I/Os.
|
|||
The path selector name is 'queue-length'.
|
||||
|
||||
Table parameters for each path: [<repeat_count>]
|
||||
|
||||
::
|
||||
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight>
|
||||
|
||||
::
|
||||
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight>: The number of in-flight I/Os on the path.
|
||||
|
@ -29,11 +36,13 @@ Examples
|
|||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128.
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
|
||||
::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
|
|
@ -1,3 +1,4 @@
|
|||
=======
|
||||
dm-raid
|
||||
=======
|
||||
|
||||
|
@ -8,49 +9,66 @@ interface.
|
|||
|
||||
Mapping Table Interface
|
||||
-----------------------
|
||||
The target is named "raid" and it accepts the following parameters:
|
||||
The target is named "raid" and it accepts the following parameters::
|
||||
|
||||
<raid_type> <#raid_params> <raid_params> \
|
||||
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
|
||||
|
||||
<raid_type>:
|
||||
|
||||
============= ===============================================================
|
||||
raid0 RAID0 striping (no resilience)
|
||||
raid1 RAID1 mirroring
|
||||
raid4 RAID4 with dedicated last parity disk
|
||||
raid5_n RAID5 with dedicated last parity disk supporting takeover
|
||||
Same as raid4
|
||||
-Transitory layout
|
||||
|
||||
- Transitory layout
|
||||
raid5_la RAID5 left asymmetric
|
||||
|
||||
- rotating parity 0 with data continuation
|
||||
raid5_ra RAID5 right asymmetric
|
||||
|
||||
- rotating parity N with data continuation
|
||||
raid5_ls RAID5 left symmetric
|
||||
|
||||
- rotating parity 0 with data restart
|
||||
raid5_rs RAID5 right symmetric
|
||||
|
||||
- rotating parity N with data restart
|
||||
raid6_zr RAID6 zero restart
|
||||
|
||||
- rotating parity zero (left-to-right) with data restart
|
||||
raid6_nr RAID6 N restart
|
||||
|
||||
- rotating parity N (right-to-left) with data restart
|
||||
raid6_nc RAID6 N continue
|
||||
|
||||
- rotating parity N (right-to-left) with data continuation
|
||||
raid6_n_6 RAID6 with dedicate parity disks
|
||||
|
||||
- parity and Q-syndrome on the last 2 disks;
|
||||
layout for takeover from/to raid4/raid5_n
|
||||
raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_la from/to raid6
|
||||
raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_ra from/to raid6
|
||||
raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_ls from/to raid6
|
||||
raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_rs from/to raid6
|
||||
raid10 Various RAID10 inspired algorithms chosen by additional params
|
||||
(see raid10_format and raid10_copies below)
|
||||
|
||||
- RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
|
||||
- RAID1E: Integrated Adjacent Stripe Mirroring
|
||||
- RAID1E: Integrated Offset Stripe Mirroring
|
||||
- and other similar RAID10 variants
|
||||
- and other similar RAID10 variants
|
||||
============= ===============================================================
|
||||
|
||||
Reference: Chapter 4 of
|
||||
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
|
||||
|
@ -58,33 +76,41 @@ The target is named "raid" and it accepts the following parameters:
|
|||
<#raid_params>: The number of parameters that follow.
|
||||
|
||||
<raid_params> consists of
|
||||
|
||||
Mandatory parameters:
|
||||
<chunk_size>: Chunk size in sectors. This parameter is often known as
|
||||
<chunk_size>:
|
||||
Chunk size in sectors. This parameter is often known as
|
||||
"stripe size". It is the only mandatory parameter and
|
||||
is placed first.
|
||||
|
||||
followed by optional parameters (in any order):
|
||||
[sync|nosync] Force or prevent RAID initialization.
|
||||
[sync|nosync]
|
||||
Force or prevent RAID initialization.
|
||||
|
||||
[rebuild <idx>] Rebuild drive number 'idx' (first drive is 0).
|
||||
[rebuild <idx>]
|
||||
Rebuild drive number 'idx' (first drive is 0).
|
||||
|
||||
[daemon_sleep <ms>]
|
||||
Interval between runs of the bitmap daemon that
|
||||
clear bits. A longer interval means less bitmap I/O but
|
||||
resyncing after a failure is likely to take longer.
|
||||
|
||||
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization
|
||||
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization
|
||||
[write_mostly <idx>] Mark drive index 'idx' write-mostly.
|
||||
[max_write_behind <sectors>] See '--write-behind=' (man mdadm)
|
||||
[stripe_cache <sectors>] Stripe cache size (RAID 4/5/6 only)
|
||||
[min_recovery_rate <kB/sec/disk>]
|
||||
Throttle RAID initialization
|
||||
[max_recovery_rate <kB/sec/disk>]
|
||||
Throttle RAID initialization
|
||||
[write_mostly <idx>]
|
||||
Mark drive index 'idx' write-mostly.
|
||||
[max_write_behind <sectors>]
|
||||
See '--write-behind=' (man mdadm)
|
||||
[stripe_cache <sectors>]
|
||||
Stripe cache size (RAID 4/5/6 only)
|
||||
[region_size <sectors>]
|
||||
The region_size multiplied by the number of regions is the
|
||||
logical size of the array. The bitmap records the device
|
||||
synchronisation state for each region.
|
||||
|
||||
[raid10_copies <# copies>]
|
||||
[raid10_format <near|far|offset>]
|
||||
[raid10_copies <# copies>], [raid10_format <near|far|offset>]
|
||||
These two options are used to alter the default layout of
|
||||
a RAID10 configuration. The number of copies is can be
|
||||
specified, but the default is 2. There are also three
|
||||
|
@ -93,13 +119,17 @@ The target is named "raid" and it accepts the following parameters:
|
|||
respect to mirroring. If these options are left unspecified,
|
||||
or 'raid10_copies 2' and/or 'raid10_format near' are given,
|
||||
then the layouts for 2, 3 and 4 devices are:
|
||||
|
||||
======== ========== ==============
|
||||
2 drives 3 drives 4 drives
|
||||
-------- ---------- --------------
|
||||
======== ========== ==============
|
||||
A1 A1 A1 A1 A2 A1 A1 A2 A2
|
||||
A2 A2 A2 A3 A3 A3 A3 A4 A4
|
||||
A3 A3 A4 A4 A5 A5 A5 A6 A6
|
||||
A4 A4 A5 A6 A6 A7 A7 A8 A8
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ========== ==============
|
||||
|
||||
The 2-device layout is equivalent 2-way RAID1. The 4-device
|
||||
layout is what a traditional RAID10 would look like. The
|
||||
3-device layout is what might be called a 'RAID1E - Integrated
|
||||
|
@ -107,8 +137,10 @@ The target is named "raid" and it accepts the following parameters:
|
|||
|
||||
If 'raid10_copies 2' and 'raid10_format far', then the layouts
|
||||
for 2, 3 and 4 devices are:
|
||||
|
||||
======== ============ ===================
|
||||
2 drives 3 drives 4 drives
|
||||
-------- -------------- --------------------
|
||||
======== ============ ===================
|
||||
A1 A2 A1 A2 A3 A1 A2 A3 A4
|
||||
A3 A4 A4 A5 A6 A5 A6 A7 A8
|
||||
A5 A6 A7 A8 A9 A9 A10 A11 A12
|
||||
|
@ -117,11 +149,14 @@ The target is named "raid" and it accepts the following parameters:
|
|||
A4 A3 A6 A4 A5 A6 A5 A8 A7
|
||||
A6 A5 A9 A7 A8 A10 A9 A12 A11
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ============ ===================
|
||||
|
||||
If 'raid10_copies 2' and 'raid10_format offset', then the
|
||||
layouts for 2, 3 and 4 devices are:
|
||||
|
||||
======== ========== ================
|
||||
2 drives 3 drives 4 drives
|
||||
-------- ------------ -----------------
|
||||
======== ========== ================
|
||||
A1 A2 A1 A2 A3 A1 A2 A3 A4
|
||||
A2 A1 A3 A1 A2 A2 A1 A4 A3
|
||||
A3 A4 A4 A5 A6 A5 A6 A7 A8
|
||||
|
@ -129,6 +164,8 @@ The target is named "raid" and it accepts the following parameters:
|
|||
A5 A6 A7 A8 A9 A9 A10 A11 A12
|
||||
A6 A5 A9 A7 A8 A10 A9 A12 A11
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ========== ================
|
||||
|
||||
Here we see layouts closely akin to 'RAID1E - Integrated
|
||||
Offset Stripe Mirroring'.
|
||||
|
||||
|
@ -190,22 +227,25 @@ The target is named "raid" and it accepts the following parameters:
|
|||
|
||||
Example Tables
|
||||
--------------
|
||||
# RAID4 - 4 data drives, 1 parity (no metadata devices)
|
||||
# No metadata devices specified to hold superblock/bitmap info
|
||||
# Chunk size of 1MiB
|
||||
# (Lines separated for easy reading)
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 1 2048 \
|
||||
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
|
||||
::
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (with metadata devices)
|
||||
# Chunk size of 1MiB, force RAID initialization,
|
||||
# min recovery rate at 20 kiB/sec/disk
|
||||
# RAID4 - 4 data drives, 1 parity (no metadata devices)
|
||||
# No metadata devices specified to hold superblock/bitmap info
|
||||
# Chunk size of 1MiB
|
||||
# (Lines separated for easy reading)
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 4 2048 sync min_recovery_rate 20 \
|
||||
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
|
||||
0 1960893648 raid \
|
||||
raid4 1 2048 \
|
||||
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (with metadata devices)
|
||||
# Chunk size of 1MiB, force RAID initialization,
|
||||
# min recovery rate at 20 kiB/sec/disk
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 4 2048 sync min_recovery_rate 20 \
|
||||
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
|
||||
|
||||
|
||||
Status Output
|
||||
|
@ -219,41 +259,58 @@ Arguments that can be repeated are ordered by value.
|
|||
|
||||
'dmsetup status' yields information on the state and health of the array.
|
||||
The output is as follows (normally a single line, but expanded here for
|
||||
clarity):
|
||||
1: <s> <l> raid \
|
||||
2: <raid_type> <#devices> <health_chars> \
|
||||
3: <sync_ratio> <sync_action> <mismatch_cnt>
|
||||
clarity)::
|
||||
|
||||
1: <s> <l> raid \
|
||||
2: <raid_type> <#devices> <health_chars> \
|
||||
3: <sync_ratio> <sync_action> <mismatch_cnt>
|
||||
|
||||
Line 1 is the standard output produced by device-mapper.
|
||||
Line 2 & 3 are produced by the raid target and are best explained by example:
|
||||
|
||||
Line 2 & 3 are produced by the raid target and are best explained by example::
|
||||
|
||||
0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
|
||||
|
||||
Here we can see the RAID type is raid4, there are 5 devices - all of
|
||||
which are 'A'live, and the array is 2/490221568 complete with its initial
|
||||
recovery. Here is a fuller description of the individual fields:
|
||||
|
||||
=============== =========================================================
|
||||
<raid_type> Same as the <raid_type> used to create the array.
|
||||
<health_chars> One char for each device, indicating: 'A' = alive and
|
||||
in-sync, 'a' = alive but not in-sync, 'D' = dead/failed.
|
||||
<health_chars> One char for each device, indicating:
|
||||
|
||||
- 'A' = alive and in-sync
|
||||
- 'a' = alive but not in-sync
|
||||
- 'D' = dead/failed.
|
||||
<sync_ratio> The ratio indicating how much of the array has undergone
|
||||
the process described by 'sync_action'. If the
|
||||
'sync_action' is "check" or "repair", then the process
|
||||
of "resync" or "recover" can be considered complete.
|
||||
<sync_action> One of the following possible states:
|
||||
idle - No synchronization action is being performed.
|
||||
frozen - The current action has been halted.
|
||||
resync - Array is undergoing its initial synchronization
|
||||
|
||||
idle
|
||||
- No synchronization action is being performed.
|
||||
frozen
|
||||
- The current action has been halted.
|
||||
resync
|
||||
- Array is undergoing its initial synchronization
|
||||
or is resynchronizing after an unclean shutdown
|
||||
(possibly aided by a bitmap).
|
||||
recover - A device in the array is being rebuilt or
|
||||
recover
|
||||
- A device in the array is being rebuilt or
|
||||
replaced.
|
||||
check - A user-initiated full check of the array is
|
||||
check
|
||||
- A user-initiated full check of the array is
|
||||
being performed. All blocks are read and
|
||||
checked for consistency. The number of
|
||||
discrepancies found are recorded in
|
||||
<mismatch_cnt>. No changes are made to the
|
||||
array by this action.
|
||||
repair - The same as "check", but discrepancies are
|
||||
repair
|
||||
- The same as "check", but discrepancies are
|
||||
corrected.
|
||||
reshape - The array is undergoing a reshape.
|
||||
reshape
|
||||
- The array is undergoing a reshape.
|
||||
<mismatch_cnt> The number of discrepancies found between mirror copies
|
||||
in RAID1/10 or wrong parity values found in RAID4/5/6.
|
||||
This value is valid only after a "check" of the array
|
||||
|
@ -261,10 +318,11 @@ recovery. Here is a fuller description of the individual fields:
|
|||
<data_offset> The current data offset to the start of the user data on
|
||||
each component device of a raid set (see the respective
|
||||
raid parameter to support out-of-place reshaping).
|
||||
<journal_char> 'A' - active write-through journal device.
|
||||
'a' - active write-back journal device.
|
||||
'D' - dead journal device.
|
||||
'-' - no journal device.
|
||||
<journal_char> - 'A' - active write-through journal device.
|
||||
- 'a' - active write-back journal device.
|
||||
- 'D' - dead journal device.
|
||||
- '-' - no journal device.
|
||||
=============== =========================================================
|
||||
|
||||
|
||||
Message Interface
|
||||
|
@ -272,12 +330,15 @@ Message Interface
|
|||
The dm-raid target will accept certain actions through the 'message' interface.
|
||||
('man dmsetup' for more information on the message interface.) These actions
|
||||
include:
|
||||
"idle" - Halt the current sync action.
|
||||
"frozen" - Freeze the current sync action.
|
||||
"resync" - Initiate/continue a resync.
|
||||
"recover"- Initiate/continue a recover process.
|
||||
"check" - Initiate a check (i.e. a "scrub") of the array.
|
||||
"repair" - Initiate a repair of the array.
|
||||
|
||||
========= ================================================
|
||||
"idle" Halt the current sync action.
|
||||
"frozen" Freeze the current sync action.
|
||||
"resync" Initiate/continue a resync.
|
||||
"recover" Initiate/continue a recover process.
|
||||
"check" Initiate a check (i.e. a "scrub") of the array.
|
||||
"repair" Initiate a repair of the array.
|
||||
========= ================================================
|
||||
|
||||
|
||||
Discard Support
|
||||
|
@ -307,48 +368,52 @@ increasingly whitelisted in the kernel and can thus be trusted.
|
|||
|
||||
For trusted devices, the following dm-raid module parameter can be set
|
||||
to safely enable discard support for RAID 4/5/6:
|
||||
|
||||
'devices_handle_discards_safely'
|
||||
|
||||
|
||||
Version History
|
||||
---------------
|
||||
1.0.0 Initial version. Support for RAID 4/5/6
|
||||
1.1.0 Added support for RAID 1
|
||||
1.2.0 Handle creation of arrays that contain failed devices.
|
||||
1.3.0 Added support for RAID 10
|
||||
1.3.1 Allow device replacement/rebuild for RAID 10
|
||||
1.3.2 Fix/improve redundancy checking for RAID10
|
||||
1.4.0 Non-functional change. Removes arg from mapping function.
|
||||
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
|
||||
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
||||
1.5.0 Add message interface to allow manipulation of the sync_action.
|
||||
|
||||
::
|
||||
|
||||
1.0.0 Initial version. Support for RAID 4/5/6
|
||||
1.1.0 Added support for RAID 1
|
||||
1.2.0 Handle creation of arrays that contain failed devices.
|
||||
1.3.0 Added support for RAID 10
|
||||
1.3.1 Allow device replacement/rebuild for RAID 10
|
||||
1.3.2 Fix/improve redundancy checking for RAID10
|
||||
1.4.0 Non-functional change. Removes arg from mapping function.
|
||||
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
|
||||
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
||||
1.5.0 Add message interface to allow manipulation of the sync_action.
|
||||
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
|
||||
1.5.1 Add ability to restore transiently failed devices on resume.
|
||||
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
|
||||
1.6.0 Add discard support (and devices_handle_discard_safely module param).
|
||||
1.7.0 Add support for MD RAID0 mappings.
|
||||
1.8.0 Explicitly check for compatible flags in the superblock metadata
|
||||
1.5.1 Add ability to restore transiently failed devices on resume.
|
||||
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
|
||||
1.6.0 Add discard support (and devices_handle_discard_safely module param).
|
||||
1.7.0 Add support for MD RAID0 mappings.
|
||||
1.8.0 Explicitly check for compatible flags in the superblock metadata
|
||||
and reject to start the raid set if any are set by a newer
|
||||
target version, thus avoiding data corruption on a raid set
|
||||
with a reshape in progress.
|
||||
1.9.0 Add support for RAID level takeover/reshape/region size
|
||||
1.9.0 Add support for RAID level takeover/reshape/region size
|
||||
and set size reduction.
|
||||
1.9.1 Fix activation of existing RAID 4/10 mapped devices
|
||||
1.9.2 Don't emit '- -' on the status table line in case the constructor
|
||||
1.9.1 Fix activation of existing RAID 4/10 mapped devices
|
||||
1.9.2 Don't emit '- -' on the status table line in case the constructor
|
||||
fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
|
||||
'D' on the status line. If '- -' is passed into the constructor, emit
|
||||
'- -' on the table line and '-' as the status line health character.
|
||||
1.10.0 Add support for raid4/5/6 journal device
|
||||
1.10.1 Fix data corruption on reshape request
|
||||
1.11.0 Fix table line argument order
|
||||
1.10.0 Add support for raid4/5/6 journal device
|
||||
1.10.1 Fix data corruption on reshape request
|
||||
1.11.0 Fix table line argument order
|
||||
(wrong raid10_copies/raid10_format sequence)
|
||||
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
|
||||
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
|
||||
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
|
||||
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
|
||||
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
|
||||
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
|
||||
state races.
|
||||
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
|
||||
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
|
||||
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
|
||||
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
|
||||
deadlock/potential data corruption. Update superblock when
|
||||
specific devices are requested via rebuild. Fix RAID leg
|
||||
rebuild errors.
|
|
@ -1,3 +1,4 @@
|
|||
===============
|
||||
dm-service-time
|
||||
===============
|
||||
|
||||
|
@ -12,25 +13,34 @@ in a path-group, and it can be specified as a table argument.
|
|||
|
||||
The path selector name is 'service-time'.
|
||||
|
||||
Table parameters for each path: [<repeat_count> [<relative_throughput>]]
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
Table parameters for each path:
|
||||
|
||||
[<repeat_count> [<relative_throughput>]]
|
||||
<repeat_count>:
|
||||
The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
<relative_throughput>: The relative throughput value of the path
|
||||
<relative_throughput>:
|
||||
The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
The valid range is 0-100.
|
||||
If not given, minimum value '1' is used.
|
||||
If '0' is given, the path isn't selected while
|
||||
other paths having a positive value are available.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight-size> \
|
||||
<relative_throughput>
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight-size>: The size of in-flight I/Os on the path.
|
||||
<relative_throughput>: The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
Status for each path:
|
||||
|
||||
<status> <fail-count> <in-flight-size> <relative_throughput>
|
||||
<status>:
|
||||
'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>:
|
||||
The number of path failures.
|
||||
<in-flight-size>:
|
||||
The size of in-flight I/Os on the path.
|
||||
<relative_throughput>:
|
||||
The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
|
||||
|
||||
Algorithm
|
||||
|
@ -39,7 +49,7 @@ Algorithm
|
|||
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
|
||||
dispatched and subtracts when completed.
|
||||
Basically, dm-service-time selects a path having minimum service time
|
||||
which is calculated by:
|
||||
which is calculated by::
|
||||
|
||||
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
|
||||
|
||||
|
@ -67,25 +77,25 @@ Examples
|
|||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128
|
||||
and sda has an average throughput 1GB/s and sdb has 4GB/s,
|
||||
'relative_throughput' value may be '1' for sda and '4' for sdb.
|
||||
'relative_throughput' value may be '1' for sda and '4' for sdb::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
|
||||
|
||||
|
||||
Or '2' for sda and '8' for sdb would be also true.
|
||||
Or '2' for sda and '8' for sdb would be also true::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
|
|
@ -0,0 +1,110 @@
|
|||
====================
|
||||
device-mapper uevent
|
||||
====================
|
||||
|
||||
The device-mapper uevent code adds the capability to device-mapper to create
|
||||
and send kobject uevents (uevents). Previously device-mapper events were only
|
||||
available through the ioctl interface. The advantage of the uevents interface
|
||||
is the event contains environment attributes providing increased context for
|
||||
the event avoiding the need to query the state of the device-mapper device after
|
||||
the event is received.
|
||||
|
||||
There are two functions currently for device-mapper events. The first function
|
||||
listed creates the event and the second function sends the event(s)::
|
||||
|
||||
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
|
||||
const char *path, unsigned nr_valid_paths)
|
||||
|
||||
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
|
||||
|
||||
|
||||
The variables added to the uevent environment are:
|
||||
|
||||
Variable Name: DM_TARGET
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description:
|
||||
:Value: Name of device-mapper target that generated the event.
|
||||
|
||||
Variable Name: DM_ACTION
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description:
|
||||
:Value: Device-mapper specific action that caused the uevent action.
|
||||
PATH_FAILED - A path has failed;
|
||||
PATH_REINSTATED - A path has been reinstated.
|
||||
|
||||
Variable Name: DM_SEQNUM
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: unsigned integer
|
||||
:Description: A sequence number for this specific device-mapper device.
|
||||
:Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_PATH
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: Major and minor number of the path device pertaining to this
|
||||
event.
|
||||
:Value: Path name in the form of "Major:Minor"
|
||||
|
||||
Variable Name: DM_NR_VALID_PATHS
|
||||
--------------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: unsigned integer
|
||||
:Description:
|
||||
:Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_NAME
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: Name of the device-mapper device.
|
||||
:Value: Name
|
||||
|
||||
Variable Name: DM_UUID
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: UUID of the device-mapper device.
|
||||
:Value: UUID. (Empty string if there isn't one.)
|
||||
|
||||
An example of the uevents generated as captured by udevmonitor is shown
|
||||
below
|
||||
|
||||
1.) Path failure::
|
||||
|
||||
UEVENT[1192521009.711215] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_FAILED
|
||||
DM_SEQNUM=1
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=0
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1130
|
||||
|
||||
2.) Path reinstate::
|
||||
|
||||
UEVENT[1192521132.989927] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_REINSTATED
|
||||
DM_SEQNUM=2
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=1
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1131
|
|
@ -1,97 +0,0 @@
|
|||
The device-mapper uevent code adds the capability to device-mapper to create
|
||||
and send kobject uevents (uevents). Previously device-mapper events were only
|
||||
available through the ioctl interface. The advantage of the uevents interface
|
||||
is the event contains environment attributes providing increased context for
|
||||
the event avoiding the need to query the state of the device-mapper device after
|
||||
the event is received.
|
||||
|
||||
There are two functions currently for device-mapper events. The first function
|
||||
listed creates the event and the second function sends the event(s).
|
||||
|
||||
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
|
||||
const char *path, unsigned nr_valid_paths)
|
||||
|
||||
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
|
||||
|
||||
|
||||
The variables added to the uevent environment are:
|
||||
|
||||
Variable Name: DM_TARGET
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description:
|
||||
Value: Name of device-mapper target that generated the event.
|
||||
|
||||
Variable Name: DM_ACTION
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description:
|
||||
Value: Device-mapper specific action that caused the uevent action.
|
||||
PATH_FAILED - A path has failed.
|
||||
PATH_REINSTATED - A path has been reinstated.
|
||||
|
||||
Variable Name: DM_SEQNUM
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: unsigned integer
|
||||
Description: A sequence number for this specific device-mapper device.
|
||||
Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_PATH
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: Major and minor number of the path device pertaining to this
|
||||
event.
|
||||
Value: Path name in the form of "Major:Minor"
|
||||
|
||||
Variable Name: DM_NR_VALID_PATHS
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: unsigned integer
|
||||
Description:
|
||||
Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_NAME
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: Name of the device-mapper device.
|
||||
Value: Name
|
||||
|
||||
Variable Name: DM_UUID
|
||||
Uevent Action(s): KOBJ_CHANGE
|
||||
Type: string
|
||||
Description: UUID of the device-mapper device.
|
||||
Value: UUID. (Empty string if there isn't one.)
|
||||
|
||||
An example of the uevents generated as captured by udevmonitor is shown
|
||||
below.
|
||||
|
||||
1.) Path failure.
|
||||
UEVENT[1192521009.711215] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_FAILED
|
||||
DM_SEQNUM=1
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=0
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1130
|
||||
|
||||
2.) Path reinstate.
|
||||
UEVENT[1192521132.989927] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_REINSTATED
|
||||
DM_SEQNUM=2
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=1
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1131
|
|
@ -1,3 +1,4 @@
|
|||
========
|
||||
dm-zoned
|
||||
========
|
||||
|
||||
|
@ -133,12 +134,13 @@ A zoned block device must first be formatted using the dmzadm tool. This
|
|||
will analyze the device zone configuration, determine where to place the
|
||||
metadata sets on the device and initialize the metadata sets.
|
||||
|
||||
Ex:
|
||||
Ex::
|
||||
|
||||
dmzadm --format /dev/sdxx
|
||||
dmzadm --format /dev/sdxx
|
||||
|
||||
For a formatted device, the target can be created normally with the
|
||||
dmsetup utility. The only parameter that dm-zoned requires is the
|
||||
underlying zoned block device name. Ex:
|
||||
underlying zoned block device name. Ex::
|
||||
|
||||
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
|
||||
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
|
||||
dmsetup create dmz-`basename ${dev}`
|
|
@ -1,3 +1,7 @@
|
|||
======
|
||||
dm-era
|
||||
======
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
|
@ -14,12 +18,14 @@ coherency after rolling back a vendor snapshot.
|
|||
Constructor
|
||||
===========
|
||||
|
||||
era <metadata dev> <origin dev> <block size>
|
||||
era <metadata dev> <origin dev> <block size>
|
||||
|
||||
metadata dev : fast device holding the persistent metadata
|
||||
origin dev : device holding data blocks that may change
|
||||
block size : block size of origin data device, granularity that is
|
||||
tracked by the target
|
||||
================ ======================================================
|
||||
metadata dev fast device holding the persistent metadata
|
||||
origin dev device holding data blocks that may change
|
||||
block size block size of origin data device, granularity that is
|
||||
tracked by the target
|
||||
================ ======================================================
|
||||
|
||||
Messages
|
||||
========
|
||||
|
@ -49,14 +55,16 @@ Status
|
|||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<current era> <held metadata root | '-'>
|
||||
|
||||
metadata block size : Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks : Number of metadata blocks used
|
||||
#total metadata blocks : Total number of metadata blocks
|
||||
current era : The current era
|
||||
held metadata root : The location, in blocks, of the metadata root
|
||||
that has been 'held' for userspace read
|
||||
access. '-' indicates there is no held root
|
||||
========================= ==============================================
|
||||
metadata block size Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks Number of metadata blocks used
|
||||
#total metadata blocks Total number of metadata blocks
|
||||
current era The current era
|
||||
held metadata root The location, in blocks, of the metadata root
|
||||
that has been 'held' for userspace read
|
||||
access. '-' indicates there is no held root
|
||||
========================= ==============================================
|
||||
|
||||
Detailed use case
|
||||
=================
|
||||
|
@ -88,7 +96,7 @@ Memory usage
|
|||
|
||||
The target uses a bitset to record writes in the current era. It also
|
||||
has a spare bitset ready for switching over to a new era. Other than
|
||||
that it uses a few 4k blocks for updating metadata.
|
||||
that it uses a few 4k blocks for updating metadata::
|
||||
|
||||
(4 * nr_blocks) bytes + buffers
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
:orphan:
|
||||
|
||||
=============
|
||||
Device Mapper
|
||||
=============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
cache-policies
|
||||
cache
|
||||
delay
|
||||
dm-crypt
|
||||
dm-flakey
|
||||
dm-init
|
||||
dm-integrity
|
||||
dm-io
|
||||
dm-log
|
||||
dm-queue-length
|
||||
dm-raid
|
||||
dm-service-time
|
||||
dm-uevent
|
||||
dm-zoned
|
||||
era
|
||||
kcopyd
|
||||
linear
|
||||
log-writes
|
||||
persistent-data
|
||||
snapshot
|
||||
statistics
|
||||
striped
|
||||
switch
|
||||
thin-provisioning
|
||||
unstriped
|
||||
verity
|
||||
writecache
|
||||
zero
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
|
@ -1,3 +1,4 @@
|
|||
======
|
||||
kcopyd
|
||||
======
|
||||
|
||||
|
@ -7,7 +8,7 @@ notification. It is used by dm-snapshot and dm-mirror.
|
|||
|
||||
Users of kcopyd must first create a client and indicate how many memory pages
|
||||
to set aside for their copy jobs. This is done with a call to
|
||||
kcopyd_client_create().
|
||||
kcopyd_client_create()::
|
||||
|
||||
int kcopyd_client_create(unsigned int num_pages,
|
||||
struct kcopyd_client **result);
|
||||
|
@ -16,7 +17,7 @@ To start a copy job, the user must set up io_region structures to describe
|
|||
the source and destinations of the copy. Each io_region indicates a
|
||||
block-device along with the starting sector and size of the region. The source
|
||||
of the copy is given as one io_region structure, and the destinations of the
|
||||
copy are given as an array of io_region structures.
|
||||
copy are given as an array of io_region structures::
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
|
@ -26,7 +27,7 @@ copy are given as an array of io_region structures.
|
|||
|
||||
To start the copy, the user calls kcopyd_copy(), passing in the client
|
||||
pointer, pointers to the source and destination io_regions, the name of a
|
||||
completion callback routine, and a pointer to some context data for the copy.
|
||||
completion callback routine, and a pointer to some context data for the copy::
|
||||
|
||||
int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
|
||||
unsigned int num_dests, struct io_region *dests,
|
||||
|
@ -41,7 +42,6 @@ write error occurred during the copy.
|
|||
|
||||
When a user is done with all their copy jobs, they should call
|
||||
kcopyd_client_destroy() to delete the kcopyd client, which will release the
|
||||
associated memory pages.
|
||||
associated memory pages::
|
||||
|
||||
void kcopyd_client_destroy(struct kcopyd_client *kc);
|
||||
|
|
@ -0,0 +1,63 @@
|
|||
=========
|
||||
dm-linear
|
||||
=========
|
||||
|
||||
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
|
||||
device onto a linear range of another device. This is the basic building
|
||||
block of logical volume managers.
|
||||
|
||||
Parameters: <dev path> <offset>
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create an identity mapping for a device
|
||||
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Join 2 devices together
|
||||
size1=`blockdev --getsz $1`
|
||||
size2=`blockdev --getsz $2`
|
||||
echo "0 $size1 linear $1 0
|
||||
$size1 $size2 linear $2 0" | dmsetup create joined
|
||||
|
||||
::
|
||||
|
||||
#!/usr/bin/perl -w
|
||||
# Split a device into 4M chunks and then join them together in reverse order.
|
||||
|
||||
my $name = "reverse";
|
||||
my $extent_size = 4 * 1024 * 2;
|
||||
my $dev = $ARGV[0];
|
||||
my $table = "";
|
||||
my $count = 0;
|
||||
|
||||
if (!defined($dev)) {
|
||||
die("Please specify a device.\n");
|
||||
}
|
||||
|
||||
my $dev_size = `blockdev --getsz $dev`;
|
||||
my $extents = int($dev_size / $extent_size) -
|
||||
(($dev_size % $extent_size) ? 1 : 0);
|
||||
|
||||
while ($extents > 0) {
|
||||
my $this_start = $count * $extent_size;
|
||||
$extents--;
|
||||
$count++;
|
||||
my $this_offset = $extents * $extent_size;
|
||||
|
||||
$table .= "$this_start $extent_size linear $dev $this_offset\n";
|
||||
}
|
||||
|
||||
`echo \"$table\" | dmsetup create $name`;
|
|
@ -1,61 +0,0 @@
|
|||
dm-linear
|
||||
=========
|
||||
|
||||
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
|
||||
device onto a linear range of another device. This is the basic building
|
||||
block of logical volume managers.
|
||||
|
||||
Parameters: <dev path> <offset>
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Create an identity mapping for a device
|
||||
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
|
||||
]]
|
||||
|
||||
|
||||
[[
|
||||
#!/bin/sh
|
||||
# Join 2 devices together
|
||||
size1=`blockdev --getsz $1`
|
||||
size2=`blockdev --getsz $2`
|
||||
echo "0 $size1 linear $1 0
|
||||
$size1 $size2 linear $2 0" | dmsetup create joined
|
||||
]]
|
||||
|
||||
|
||||
[[
|
||||
#!/usr/bin/perl -w
|
||||
# Split a device into 4M chunks and then join them together in reverse order.
|
||||
|
||||
my $name = "reverse";
|
||||
my $extent_size = 4 * 1024 * 2;
|
||||
my $dev = $ARGV[0];
|
||||
my $table = "";
|
||||
my $count = 0;
|
||||
|
||||
if (!defined($dev)) {
|
||||
die("Please specify a device.\n");
|
||||
}
|
||||
|
||||
my $dev_size = `blockdev --getsz $dev`;
|
||||
my $extents = int($dev_size / $extent_size) -
|
||||
(($dev_size % $extent_size) ? 1 : 0);
|
||||
|
||||
while ($extents > 0) {
|
||||
my $this_start = $count * $extent_size;
|
||||
$extents--;
|
||||
$count++;
|
||||
my $this_offset = $extents * $extent_size;
|
||||
|
||||
$table .= "$this_start $extent_size linear $dev $this_offset\n";
|
||||
}
|
||||
|
||||
`echo \"$table\" | dmsetup create $name`;
|
||||
]]
|
|
@ -1,3 +1,4 @@
|
|||
=============
|
||||
dm-log-writes
|
||||
=============
|
||||
|
||||
|
@ -25,11 +26,11 @@ completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
|
|||
simulate the worst case scenario with regard to power failures. Consider the
|
||||
following example (W means write, C means complete):
|
||||
|
||||
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||||
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||||
|
||||
The log would show the following
|
||||
The log would show the following:
|
||||
|
||||
W3,W2,flush,W1....
|
||||
W3,W2,flush,W1....
|
||||
|
||||
Again this is to simulate what is actually on disk, this allows us to detect
|
||||
cases where a power failure at a particular point in time would create an
|
||||
|
@ -42,11 +43,11 @@ Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
|
|||
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
|
||||
request. Consider the following example:
|
||||
|
||||
WRITE block 1, DISCARD block 1, FLUSH
|
||||
WRITE block 1, DISCARD block 1, FLUSH
|
||||
|
||||
If we logged DISCARD when it completed, the replay would look like this
|
||||
If we logged DISCARD when it completed, the replay would look like this:
|
||||
|
||||
DISCARD 1, WRITE 1, FLUSH
|
||||
DISCARD 1, WRITE 1, FLUSH
|
||||
|
||||
which isn't quite what happened and wouldn't be caught during the log replay.
|
||||
|
||||
|
@ -57,15 +58,19 @@ i) Constructor
|
|||
|
||||
log-writes <dev_path> <log_dev_path>
|
||||
|
||||
dev_path : Device that all of the IO will go to normally.
|
||||
log_dev_path : Device where the log entries are written to.
|
||||
============= ==============================================
|
||||
dev_path Device that all of the IO will go to normally.
|
||||
log_dev_path Device where the log entries are written to.
|
||||
============= ==============================================
|
||||
|
||||
ii) Status
|
||||
|
||||
<#logged entries> <highest allocated sector>
|
||||
|
||||
#logged entries : Number of logged entries
|
||||
highest allocated sector : Highest allocated sector
|
||||
=========================== ========================
|
||||
#logged entries Number of logged entries
|
||||
highest allocated sector Highest allocated sector
|
||||
=========================== ========================
|
||||
|
||||
iii) Messages
|
||||
|
||||
|
@ -75,15 +80,15 @@ iii) Messages
|
|||
For example say you want to fsck a file system after every
|
||||
write, but first you need to replay up to the mkfs to make sure
|
||||
we're fsck'ing something reasonable, you would do something like
|
||||
this:
|
||||
this::
|
||||
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
<run test>
|
||||
|
||||
This would allow you to replay the log up to the mkfs mark and
|
||||
then replay from that point on doing the fsck check in the
|
||||
interval that you want.
|
||||
This would allow you to replay the log up to the mkfs mark and
|
||||
then replay from that point on doing the fsck check in the
|
||||
interval that you want.
|
||||
|
||||
Every log has a mark at the end labeled "dm-log-writes-end".
|
||||
|
||||
|
@ -97,42 +102,42 @@ Example usage
|
|||
=============
|
||||
|
||||
Say you want to test fsync on your file system. You would do something like
|
||||
this:
|
||||
this::
|
||||
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<some test that does fsync at the end>
|
||||
dmsetup message log 0 mark fsync
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
umount /mnt/btrfs-test
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<some test that does fsync at the end>
|
||||
dmsetup message log 0 mark fsync
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
umount /mnt/btrfs-test
|
||||
|
||||
dmsetup remove log
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||||
mount /dev/sdb /mnt/btrfs-test
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
<verify md5sum's are correct>
|
||||
dmsetup remove log
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||||
mount /dev/sdb /mnt/btrfs-test
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
<verify md5sum's are correct>
|
||||
|
||||
Another option is to do a complicated file system operation and verify the file
|
||||
system is consistent during the entire operation. You could do this with:
|
||||
Another option is to do a complicated file system operation and verify the file
|
||||
system is consistent during the entire operation. You could do this with:
|
||||
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<fsstress to dirty the fs>
|
||||
btrfs filesystem balance /mnt/btrfs-test
|
||||
umount /mnt/btrfs-test
|
||||
dmsetup remove log
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<fsstress to dirty the fs>
|
||||
btrfs filesystem balance /mnt/btrfs-test
|
||||
umount /mnt/btrfs-test
|
||||
dmsetup remove log
|
||||
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||||
btrfsck /dev/sdb
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||||
btrfsck /dev/sdb
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||||
--fsck "btrfsck /dev/sdb" --check fua
|
||||
|
||||
And that will replay the log until it sees a FUA request, run the fsck command
|
|
@ -1,3 +1,7 @@
|
|||
===============
|
||||
Persistent data
|
||||
===============
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
|
@ -1,15 +1,16 @@
|
|||
==============================
|
||||
Device-mapper snapshot support
|
||||
==============================
|
||||
|
||||
Device-mapper allows you, without massive data copying:
|
||||
|
||||
*) To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
*) To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
*) To merge a snapshot of a block device back into the snapshot's origin
|
||||
device.
|
||||
- To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
- To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
- To merge a snapshot of a block device back into the snapshot's origin
|
||||
device.
|
||||
|
||||
In the first two cases, dm copies only the chunks of data that get
|
||||
changed and uses a separate copy-on-write (COW) block device for
|
||||
|
@ -22,7 +23,7 @@ the origin device.
|
|||
There are three dm targets available:
|
||||
snapshot, snapshot-origin, and snapshot-merge.
|
||||
|
||||
*) snapshot-origin <origin>
|
||||
- snapshot-origin <origin>
|
||||
|
||||
which will normally have one or more snapshots based on it.
|
||||
Reads will be mapped directly to the backing device. For each write, the
|
||||
|
@ -30,7 +31,7 @@ original data will be saved in the <COW device> of each snapshot to keep
|
|||
its visible content unchanged, at least until the <COW device> fills up.
|
||||
|
||||
|
||||
*) snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
- snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
|
||||
A snapshot of the <origin> block device is created. Changed chunks of
|
||||
<chunksize> sectors will be stored on the <COW device>. Writes will
|
||||
|
@ -83,25 +84,25 @@ When you create the first LVM2 snapshot of a volume, four dm devices are used:
|
|||
source volume), whose table is replaced by a "snapshot-origin" mapping
|
||||
from device #1.
|
||||
|
||||
A fixed naming scheme is used, so with the following commands:
|
||||
A fixed naming scheme is used, so with the following commands::
|
||||
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
|
||||
we'll have this situation (with volumes in above order):
|
||||
we'll have this situation (with volumes in above order)::
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How snapshot-merge is used by LVM2
|
||||
|
@ -114,27 +115,28 @@ merging snapshot after it completes. The "snapshot" that hands over its
|
|||
COW device to the "snapshot-merge" is deactivated (unless using lvchange
|
||||
--refresh); but if it is left active it will simply return I/O errors.
|
||||
|
||||
A snapshot will merge into its origin with the following command:
|
||||
A snapshot will merge into its origin with the following command::
|
||||
|
||||
lvconvert --merge volumeGroup/snap
|
||||
lvconvert --merge volumeGroup/snap
|
||||
|
||||
we'll now have this situation:
|
||||
we'll now have this situation::
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
|
||||
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
|
||||
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How to determine when a merging is complete
|
||||
===========================================
|
||||
The snapshot-merge and snapshot status lines end with:
|
||||
|
||||
<sectors_allocated>/<total_sectors> <metadata_sectors>
|
||||
|
||||
Both <sectors_allocated> and <total_sectors> include both data and metadata.
|
||||
|
@ -142,35 +144,37 @@ During merging, the number of sectors allocated gets smaller and
|
|||
smaller. Merging has finished when the number of sectors holding data
|
||||
is zero, in other words <sectors_allocated> == <metadata_sectors>.
|
||||
|
||||
Here is a practical example (using a hybrid of lvm and dmsetup commands):
|
||||
Here is a practical example (using a hybrid of lvm and dmsetup commands)::
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
snap volumeGroup swi-a- 1.00g base 18.97
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
snap volumeGroup swi-a- 1.00g base 18.97
|
||||
|
||||
# dmsetup status volumeGroup-snap
|
||||
0 8388608 snapshot 397896/2097152 1560
|
||||
^^^^ metadata sectors
|
||||
# dmsetup status volumeGroup-snap
|
||||
0 8388608 snapshot 397896/2097152 1560
|
||||
^^^^ metadata sectors
|
||||
|
||||
# lvconvert --merge -b volumeGroup/snap
|
||||
Merging of volume snap started.
|
||||
# lvconvert --merge -b volumeGroup/snap
|
||||
Merging of volume snap started.
|
||||
|
||||
# lvs volumeGroup/snap
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup Owi-a- 4.00g 17.23
|
||||
# lvs volumeGroup/snap
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup Owi-a- 4.00g 17.23
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 281688/2097152 1104
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 281688/2097152 1104
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 180480/2097152 712
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 180480/2097152 712
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 16/2097152 16
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 16/2097152 16
|
||||
|
||||
Merging has finished.
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
::
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
|
@ -1,3 +1,4 @@
|
|||
=============
|
||||
DM statistics
|
||||
=============
|
||||
|
||||
|
@ -11,7 +12,7 @@ Individual statistics will be collected for each step-sized area within
|
|||
the range specified.
|
||||
|
||||
The I/O statistics counters for each step-sized area of a region are
|
||||
in the same format as /sys/block/*/stat or /proc/diskstats (see:
|
||||
in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
|
||||
Documentation/iostats.txt). But two extra counters (12 and 13) are
|
||||
provided: total time spent reading and writing. When the histogram
|
||||
argument is used, the 14th parameter is reported that represents the
|
||||
|
@ -32,40 +33,45 @@ on each other's data.
|
|||
The creation of DM statistics will allocate memory via kmalloc or
|
||||
fallback to using vmalloc space. At most, 1/4 of the overall system
|
||||
memory may be allocated by DM statistics. The admin can see how much
|
||||
memory is used by reading
|
||||
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
|
||||
memory is used by reading:
|
||||
|
||||
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
|
||||
|
||||
Messages
|
||||
========
|
||||
|
||||
@stats_create <range> <step>
|
||||
[<number_of_optional_arguments> <optional_arguments>...]
|
||||
[<program_id> [<aux_data>]]
|
||||
|
||||
@stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
|
||||
Create a new region and return the region_id.
|
||||
|
||||
<range>
|
||||
"-" - whole device
|
||||
"<start_sector>+<length>" - a range of <length> 512-byte sectors
|
||||
starting with <start_sector>.
|
||||
"-"
|
||||
whole device
|
||||
"<start_sector>+<length>"
|
||||
a range of <length> 512-byte sectors
|
||||
starting with <start_sector>.
|
||||
|
||||
<step>
|
||||
"<area_size>" - the range is subdivided into areas each containing
|
||||
<area_size> sectors.
|
||||
"/<number_of_areas>" - the range is subdivided into the specified
|
||||
number of areas.
|
||||
"<area_size>"
|
||||
the range is subdivided into areas each containing
|
||||
<area_size> sectors.
|
||||
"/<number_of_areas>"
|
||||
the range is subdivided into the specified
|
||||
number of areas.
|
||||
|
||||
<number_of_optional_arguments>
|
||||
The number of optional arguments
|
||||
|
||||
<optional_arguments>
|
||||
The following optional arguments are supported
|
||||
precise_timestamps - use precise timer with nanosecond resolution
|
||||
The following optional arguments are supported:
|
||||
|
||||
precise_timestamps
|
||||
use precise timer with nanosecond resolution
|
||||
instead of the "jiffies" variable. When this argument is
|
||||
used, the resulting times are in nanoseconds instead of
|
||||
milliseconds. Precise timestamps are a little bit slower
|
||||
to obtain than jiffies-based timestamps.
|
||||
histogram:n1,n2,n3,n4,... - collect histogram of latencies. The
|
||||
histogram:n1,n2,n3,n4,...
|
||||
collect histogram of latencies. The
|
||||
numbers n1, n2, etc are times that represent the boundaries
|
||||
of the histogram. If precise_timestamps is not used, the
|
||||
times are in milliseconds, otherwise they are in
|
||||
|
@ -96,21 +102,18 @@ Messages
|
|||
@stats_list message, but it doesn't use this value for anything.
|
||||
|
||||
@stats_delete <region_id>
|
||||
|
||||
Delete the region with the specified id.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
@stats_clear <region_id>
|
||||
|
||||
Clear all the counters except the in-flight i/o counters.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
@stats_list [<program_id>]
|
||||
|
||||
List all regions registered with @stats_create.
|
||||
|
||||
<program_id>
|
||||
|
@ -127,7 +130,6 @@ Messages
|
|||
if they were specified when creating the region.
|
||||
|
||||
@stats_print <region_id> [<starting_line> <number_of_lines>]
|
||||
|
||||
Print counters for each step-sized area of a region.
|
||||
|
||||
<region_id>
|
||||
|
@ -143,10 +145,11 @@ Messages
|
|||
|
||||
Output format for each step-sized area of a region:
|
||||
|
||||
<start_sector>+<length> counters
|
||||
<start_sector>+<length>
|
||||
counters
|
||||
|
||||
The first 11 counters have the same meaning as
|
||||
/sys/block/*/stat or /proc/diskstats.
|
||||
`/sys/block/*/stat or /proc/diskstats`.
|
||||
|
||||
Please refer to Documentation/iostats.txt for details.
|
||||
|
||||
|
@ -163,11 +166,11 @@ Messages
|
|||
11. the weighted number of milliseconds spent doing I/Os
|
||||
|
||||
Additional counters:
|
||||
|
||||
12. the total time spent reading in milliseconds
|
||||
13. the total time spent writing in milliseconds
|
||||
|
||||
@stats_print_clear <region_id> [<starting_line> <number_of_lines>]
|
||||
|
||||
Atomically print and then clear all the counters except the
|
||||
in-flight i/o counters. Useful when the client consuming the
|
||||
statistics does not want to lose any statistics (those updated
|
||||
|
@ -185,7 +188,6 @@ Messages
|
|||
If omitted, all lines are printed and then cleared.
|
||||
|
||||
@stats_set_aux <region_id> <aux_data>
|
||||
|
||||
Store auxiliary data aux_data for the specified region.
|
||||
|
||||
<region_id>
|
||||
|
@ -201,23 +203,23 @@ Examples
|
|||
========
|
||||
|
||||
Subdivide the DM device 'vol' into 100 pieces and start collecting
|
||||
statistics on them:
|
||||
statistics on them::
|
||||
|
||||
dmsetup message vol 0 @stats_create - /100
|
||||
|
||||
Set the auxiliary data string to "foo bar baz" (the escape for each
|
||||
space must also be escaped, otherwise the shell will consume them):
|
||||
space must also be escaped, otherwise the shell will consume them)::
|
||||
|
||||
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
|
||||
|
||||
List the statistics:
|
||||
List the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_list
|
||||
|
||||
Print the statistics:
|
||||
Print the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_print 0
|
||||
|
||||
Delete the statistics:
|
||||
Delete the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_delete 0
|
|
@ -0,0 +1,61 @@
|
|||
=========
|
||||
dm-stripe
|
||||
=========
|
||||
|
||||
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
|
||||
device across one or more underlying devices. Data is written in "chunks",
|
||||
with consecutive chunks rotating among the underlying devices. This can
|
||||
potentially provide improved I/O throughput by utilizing several physical
|
||||
devices in parallel.
|
||||
|
||||
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
|
||||
<num devs>:
|
||||
Number of underlying devices.
|
||||
<chunk size>:
|
||||
Size of each chunk of data. Must be at least as
|
||||
large as the system's PAGE_SIZE.
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
|
||||
One or more underlying devices can be specified. The striped device size must
|
||||
be a multiple of the chunk size multiplied by the number of underlying devices.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/usr/bin/perl -w
|
||||
# Create a striped device across any number of underlying devices. The device
|
||||
# will be called "stripe_dev" and have a chunk-size of 128k.
|
||||
|
||||
my $chunk_size = 128 * 2;
|
||||
my $dev_name = "stripe_dev";
|
||||
my $num_devs = @ARGV;
|
||||
my @devs = @ARGV;
|
||||
my ($min_dev_size, $stripe_dev_size, $i);
|
||||
|
||||
if (!$num_devs) {
|
||||
die("Specify at least one device\n");
|
||||
}
|
||||
|
||||
$min_dev_size = `blockdev --getsz $devs[0]`;
|
||||
for ($i = 1; $i < $num_devs; $i++) {
|
||||
my $this_size = `blockdev --getsz $devs[$i]`;
|
||||
$min_dev_size = ($min_dev_size < $this_size) ?
|
||||
$min_dev_size : $this_size;
|
||||
}
|
||||
|
||||
$stripe_dev_size = $min_dev_size * $num_devs;
|
||||
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
|
||||
|
||||
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
|
||||
for ($i = 0; $i < $num_devs; $i++) {
|
||||
$table .= " $devs[$i] 0";
|
||||
}
|
||||
|
||||
`echo $table | dmsetup create $dev_name`;
|
|
@ -1,57 +0,0 @@
|
|||
dm-stripe
|
||||
=========
|
||||
|
||||
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
|
||||
device across one or more underlying devices. Data is written in "chunks",
|
||||
with consecutive chunks rotating among the underlying devices. This can
|
||||
potentially provide improved I/O throughput by utilizing several physical
|
||||
devices in parallel.
|
||||
|
||||
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
|
||||
<num devs>: Number of underlying devices.
|
||||
<chunk size>: Size of each chunk of data. Must be at least as
|
||||
large as the system's PAGE_SIZE.
|
||||
<dev path>: Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>: Starting sector within the device.
|
||||
|
||||
One or more underlying devices can be specified. The striped device size must
|
||||
be a multiple of the chunk size multiplied by the number of underlying devices.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
[[
|
||||
#!/usr/bin/perl -w
|
||||
# Create a striped device across any number of underlying devices. The device
|
||||
# will be called "stripe_dev" and have a chunk-size of 128k.
|
||||
|
||||
my $chunk_size = 128 * 2;
|
||||
my $dev_name = "stripe_dev";
|
||||
my $num_devs = @ARGV;
|
||||
my @devs = @ARGV;
|
||||
my ($min_dev_size, $stripe_dev_size, $i);
|
||||
|
||||
if (!$num_devs) {
|
||||
die("Specify at least one device\n");
|
||||
}
|
||||
|
||||
$min_dev_size = `blockdev --getsz $devs[0]`;
|
||||
for ($i = 1; $i < $num_devs; $i++) {
|
||||
my $this_size = `blockdev --getsz $devs[$i]`;
|
||||
$min_dev_size = ($min_dev_size < $this_size) ?
|
||||
$min_dev_size : $this_size;
|
||||
}
|
||||
|
||||
$stripe_dev_size = $min_dev_size * $num_devs;
|
||||
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
|
||||
|
||||
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
|
||||
for ($i = 0; $i < $num_devs; $i++) {
|
||||
$table .= " $devs[$i] 0";
|
||||
}
|
||||
|
||||
`echo $table | dmsetup create $dev_name`;
|
||||
]]
|
||||
|
|
@ -1,3 +1,4 @@
|
|||
=========
|
||||
dm-switch
|
||||
=========
|
||||
|
||||
|
@ -67,27 +68,25 @@ b-tree can achieve.
|
|||
Construction Parameters
|
||||
=======================
|
||||
|
||||
<num_paths> <region_size> <num_optional_args> [<optional_args>...]
|
||||
[<dev_path> <offset>]+
|
||||
<num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
|
||||
<num_paths>
|
||||
The number of paths across which to distribute the I/O.
|
||||
|
||||
<num_paths>
|
||||
The number of paths across which to distribute the I/O.
|
||||
<region_size>
|
||||
The number of 512-byte sectors in a region. Each region can be redirected
|
||||
to any of the available paths.
|
||||
|
||||
<region_size>
|
||||
The number of 512-byte sectors in a region. Each region can be redirected
|
||||
to any of the available paths.
|
||||
<num_optional_args>
|
||||
The number of optional arguments. Currently, no optional arguments
|
||||
are supported and so this must be zero.
|
||||
|
||||
<num_optional_args>
|
||||
The number of optional arguments. Currently, no optional arguments
|
||||
are supported and so this must be zero.
|
||||
<dev_path>
|
||||
The block device that represents a specific path to the device.
|
||||
|
||||
<dev_path>
|
||||
The block device that represents a specific path to the device.
|
||||
|
||||
<offset>
|
||||
The offset of the start of data on the specific <dev_path> (in units
|
||||
of 512-byte sectors). This number is added to the sector number when
|
||||
forwarding the request to the specific path. Typically it is zero.
|
||||
<offset>
|
||||
The offset of the start of data on the specific <dev_path> (in units
|
||||
of 512-byte sectors). This number is added to the sector number when
|
||||
forwarding the request to the specific path. Typically it is zero.
|
||||
|
||||
Messages
|
||||
========
|
||||
|
@ -122,17 +121,21 @@ Example
|
|||
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
|
||||
the same size.
|
||||
|
||||
Create a switch device with 64kB region size:
|
||||
Create a switch device with 64kB region size::
|
||||
|
||||
dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
|
||||
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
|
||||
|
||||
Set mappings for the first 7 entries to point to devices switch0, switch1,
|
||||
switch2, switch0, switch1, switch2, switch1:
|
||||
switch2, switch0, switch1, switch2, switch1::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
|
||||
|
||||
Set repetitive mapping. This command:
|
||||
Set repetitive mapping. This command::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
|
||||
is equivalent to:
|
||||
|
||||
is equivalent to::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
|
||||
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
|
||||
|
|
@ -1,3 +1,7 @@
|
|||
=================
|
||||
Thin provisioning
|
||||
=================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
|
@ -95,6 +99,8 @@ previously.)
|
|||
Using an existing pool device
|
||||
-----------------------------
|
||||
|
||||
::
|
||||
|
||||
dmsetup create pool \
|
||||
--table "0 20971520 thin-pool $metadata_dev $data_dev \
|
||||
$data_block_size $low_water_mark"
|
||||
|
@ -154,7 +160,7 @@ Thin provisioning
|
|||
i) Creating a new thinly-provisioned volume.
|
||||
|
||||
To create a new thinly- provisioned volume you must send a message to an
|
||||
active pool device, /dev/mapper/pool in this example.
|
||||
active pool device, /dev/mapper/pool in this example::
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
|
@ -164,7 +170,7 @@ i) Creating a new thinly-provisioned volume.
|
|||
|
||||
ii) Using a thinly-provisioned volume.
|
||||
|
||||
Thinly-provisioned volumes are activated using the 'thin' target:
|
||||
Thinly-provisioned volumes are activated using the 'thin' target::
|
||||
|
||||
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
|
||||
|
||||
|
@ -181,6 +187,8 @@ i) Creating an internal snapshot.
|
|||
must suspend it before creating the snapshot to avoid corruption.
|
||||
This is NOT enforced at the moment, so please be careful!
|
||||
|
||||
::
|
||||
|
||||
dmsetup suspend /dev/mapper/thin
|
||||
dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
|
||||
dmsetup resume /dev/mapper/thin
|
||||
|
@ -198,14 +206,14 @@ ii) Using an internal snapshot.
|
|||
activating or removing them both. (This differs from conventional
|
||||
device-mapper snapshots.)
|
||||
|
||||
Activate it exactly the same way as any other thinly-provisioned volume:
|
||||
Activate it exactly the same way as any other thinly-provisioned volume::
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
|
||||
|
||||
External snapshots
|
||||
------------------
|
||||
|
||||
You can use an external _read only_ device as an origin for a
|
||||
You can use an external **read only** device as an origin for a
|
||||
thinly-provisioned volume. Any read to an unprovisioned area of the
|
||||
thin device will be passed through to the origin. Writes trigger
|
||||
the allocation of new blocks as usual.
|
||||
|
@ -223,11 +231,13 @@ i) Creating a snapshot of an external device
|
|||
This is the same as creating a thin device.
|
||||
You don't mention the origin at this stage.
|
||||
|
||||
::
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
ii) Using a snapshot of an external device.
|
||||
|
||||
Append an extra parameter to the thin target specifying the origin:
|
||||
Append an extra parameter to the thin target specifying the origin::
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
|
||||
|
||||
|
@ -240,6 +250,8 @@ Deactivation
|
|||
All devices using a pool must be deactivated before the pool itself
|
||||
can be.
|
||||
|
||||
::
|
||||
|
||||
dmsetup remove thin
|
||||
dmsetup remove snap
|
||||
dmsetup remove pool
|
||||
|
@ -252,25 +264,32 @@ Reference
|
|||
|
||||
i) Constructor
|
||||
|
||||
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||||
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||||
::
|
||||
|
||||
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||||
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||||
|
||||
Optional feature arguments:
|
||||
|
||||
skip_block_zeroing: Skip the zeroing of newly-provisioned blocks.
|
||||
skip_block_zeroing:
|
||||
Skip the zeroing of newly-provisioned blocks.
|
||||
|
||||
ignore_discard: Disable discard support.
|
||||
ignore_discard:
|
||||
Disable discard support.
|
||||
|
||||
no_discard_passdown: Don't pass discards down to the underlying
|
||||
data device, but just remove the mapping.
|
||||
no_discard_passdown:
|
||||
Don't pass discards down to the underlying
|
||||
data device, but just remove the mapping.
|
||||
|
||||
read_only: Don't allow any changes to be made to the pool
|
||||
read_only:
|
||||
Don't allow any changes to be made to the pool
|
||||
metadata. This mode is only available after the
|
||||
thin-pool has been created and first used in full
|
||||
read/write mode. It cannot be specified on initial
|
||||
thin-pool creation.
|
||||
|
||||
error_if_no_space: Error IOs, instead of queueing, if no space.
|
||||
error_if_no_space:
|
||||
Error IOs, instead of queueing, if no space.
|
||||
|
||||
Data block size must be between 64KB (128 sectors) and 1GB
|
||||
(2097152 sectors) inclusive.
|
||||
|
@ -278,10 +297,12 @@ i) Constructor
|
|||
|
||||
ii) Status
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
|
||||
needs_check|- metadata_low_watermark
|
||||
::
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
|
||||
needs_check|- metadata_low_watermark
|
||||
|
||||
transaction id:
|
||||
A 64-bit number used by userspace to help synchronise with metadata
|
||||
|
@ -336,13 +357,11 @@ ii) Status
|
|||
iii) Messages
|
||||
|
||||
create_thin <dev id>
|
||||
|
||||
Create a new thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
|
||||
create_snap <dev id> <origin id>
|
||||
|
||||
Create a new snapshot of another thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
|
@ -350,11 +369,9 @@ iii) Messages
|
|||
of which the new device will be a snapshot.
|
||||
|
||||
delete <dev id>
|
||||
|
||||
Deletes a thin device. Irreversible.
|
||||
|
||||
set_transaction_id <current id> <new id>
|
||||
|
||||
Userland volume managers, such as LVM, need a way to
|
||||
synchronise their external metadata with the internal metadata of the
|
||||
pool target. The thin-pool target offers to store an
|
||||
|
@ -364,14 +381,12 @@ iii) Messages
|
|||
compare-and-swap message.
|
||||
|
||||
reserve_metadata_snap
|
||||
|
||||
Reserve a copy of the data mapping btree for use by userland.
|
||||
This allows userland to inspect the mappings as they were when
|
||||
this message was executed. Use the pool's status command to
|
||||
get the root block associated with the metadata snapshot.
|
||||
|
||||
release_metadata_snap
|
||||
|
||||
Release a previously reserved copy of the data mapping btree.
|
||||
|
||||
'thin' target
|
||||
|
@ -379,7 +394,9 @@ iii) Messages
|
|||
|
||||
i) Constructor
|
||||
|
||||
thin <pool dev> <dev id> [<external origin dev>]
|
||||
::
|
||||
|
||||
thin <pool dev> <dev id> [<external origin dev>]
|
||||
|
||||
pool dev:
|
||||
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
|
||||
|
@ -401,8 +418,7 @@ provisioned as and when needed.
|
|||
|
||||
ii) Status
|
||||
|
||||
<nr mapped sectors> <highest mapped sector>
|
||||
|
||||
<nr mapped sectors> <highest mapped sector>
|
||||
If the pool has encountered device errors and failed, the status
|
||||
will just contain the string 'Fail'. The userspace recovery
|
||||
tools should then be used.
|
|
@ -1,3 +1,7 @@
|
|||
================================
|
||||
Device-mapper "unstriped" target
|
||||
================================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
|
@ -34,46 +38,46 @@ striped target to combine the 4 devices into one. It then will use
|
|||
the unstriped target ontop of the striped device to access the
|
||||
individual backing loop devices. We write data to the newly exposed
|
||||
unstriped devices and verify the data written matches the correct
|
||||
underlying device on the striped array.
|
||||
underlying device on the striped array::
|
||||
|
||||
#!/bin/bash
|
||||
#!/bin/bash
|
||||
|
||||
MEMBER_SIZE=$((128 * 1024 * 1024))
|
||||
NUM=4
|
||||
SEQ_END=$((${NUM}-1))
|
||||
CHUNK=256
|
||||
BS=4096
|
||||
MEMBER_SIZE=$((128 * 1024 * 1024))
|
||||
NUM=4
|
||||
SEQ_END=$((${NUM}-1))
|
||||
CHUNK=256
|
||||
BS=4096
|
||||
|
||||
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
|
||||
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
|
||||
COUNT=$((${MEMBER_SIZE} / ${BS}))
|
||||
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
|
||||
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
|
||||
COUNT=$((${MEMBER_SIZE} / ${BS}))
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
|
||||
losetup /dev/loop${i} member-${i}
|
||||
DM_PARMS+=" /dev/loop${i} 0"
|
||||
done
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
|
||||
losetup /dev/loop${i} member-${i}
|
||||
DM_PARMS+=" /dev/loop${i} 0"
|
||||
done
|
||||
|
||||
echo $DM_PARMS | dmsetup create raid0
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
|
||||
done;
|
||||
echo $DM_PARMS | dmsetup create raid0
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
|
||||
diff /dev/mapper/set-${i} member-${i}
|
||||
done;
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
|
||||
diff /dev/mapper/set-${i} member-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dmsetup remove set-${i}
|
||||
done
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dmsetup remove set-${i}
|
||||
done
|
||||
|
||||
dmsetup remove raid0
|
||||
dmsetup remove raid0
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
losetup -d /dev/loop${i}
|
||||
rm -f member-${i}
|
||||
done
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
losetup -d /dev/loop${i}
|
||||
rm -f member-${i}
|
||||
done
|
||||
|
||||
Another example
|
||||
---------------
|
||||
|
@ -81,7 +85,7 @@ Another example
|
|||
Intel NVMe drives contain two cores on the physical device.
|
||||
Each core of the drive has segregated access to its LBA range.
|
||||
The current LBA model has a RAID 0 128k chunk on each core, resulting
|
||||
in a 256k stripe across the two cores:
|
||||
in a 256k stripe across the two cores::
|
||||
|
||||
Core 0: Core 1:
|
||||
__________ __________
|
||||
|
@ -108,17 +112,24 @@ Example dmsetup usage
|
|||
|
||||
unstriped ontop of Intel NVMe device that has 2 cores
|
||||
-----------------------------------------------------
|
||||
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
|
||||
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
|
||||
|
||||
::
|
||||
|
||||
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
|
||||
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
|
||||
|
||||
There will now be two devices that expose Intel NVMe core 0 and 1
|
||||
respectively:
|
||||
/dev/mapper/nvmset0
|
||||
/dev/mapper/nvmset1
|
||||
respectively::
|
||||
|
||||
/dev/mapper/nvmset0
|
||||
/dev/mapper/nvmset1
|
||||
|
||||
unstriped ontop of striped with 4 drives using 128K chunk size
|
||||
--------------------------------------------------------------
|
||||
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
|
||||
|
||||
::
|
||||
|
||||
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
|
|
@ -1,5 +1,6 @@
|
|||
=========
|
||||
dm-verity
|
||||
==========
|
||||
=========
|
||||
|
||||
Device-Mapper's "verity" target provides transparent integrity checking of
|
||||
block devices using a cryptographic digest provided by the kernel crypto API.
|
||||
|
@ -7,6 +8,9 @@ This target is read-only.
|
|||
|
||||
Construction Parameters
|
||||
=======================
|
||||
|
||||
::
|
||||
|
||||
<version> <dev> <hash_dev>
|
||||
<data_block_size> <hash_block_size>
|
||||
<num_data_blocks> <hash_start_block>
|
||||
|
@ -160,7 +164,9 @@ calculating the parent node.
|
|||
|
||||
The tree looks something like:
|
||||
|
||||
alg = sha256, num_blocks = 32768, block_size = 4096
|
||||
alg = sha256, num_blocks = 32768, block_size = 4096
|
||||
|
||||
::
|
||||
|
||||
[ root ]
|
||||
/ . . . \
|
||||
|
@ -189,6 +195,7 @@ block boundary) are the hash blocks which are stored a depth at a time
|
|||
|
||||
The full specification of kernel parameters and on-disk metadata format
|
||||
is available at the cryptsetup project's wiki page
|
||||
|
||||
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
|
||||
|
||||
Status
|
||||
|
@ -198,7 +205,8 @@ If any check failed, C (for Corruption) is returned.
|
|||
|
||||
Example
|
||||
=======
|
||||
Set up a device:
|
||||
Set up a device::
|
||||
|
||||
# dmsetup create vroot --readonly --table \
|
||||
"0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
|
||||
"4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
|
||||
|
@ -209,11 +217,13 @@ the hash tree or activate the kernel device. This is available from
|
|||
the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
|
||||
(as a libcryptsetup extension).
|
||||
|
||||
Create hash on the device:
|
||||
Create hash on the device::
|
||||
|
||||
# veritysetup format /dev/sda1 /dev/sda2
|
||||
...
|
||||
Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
|
||||
|
||||
Activate the device:
|
||||
Activate the device::
|
||||
|
||||
# veritysetup create vroot /dev/sda1 /dev/sda2 \
|
||||
4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
|
|
@ -1,3 +1,7 @@
|
|||
=================
|
||||
Writecache target
|
||||
=================
|
||||
|
||||
The writecache target caches writes on persistent memory or on SSD. It
|
||||
doesn't cache reads because reads are supposed to be cached in page cache
|
||||
in normal RAM.
|
||||
|
@ -6,15 +10,18 @@ When the device is constructed, the first sector should be zeroed or the
|
|||
first sector should contain valid superblock from previous invocation.
|
||||
|
||||
Constructor parameters:
|
||||
|
||||
1. type of the cache device - "p" or "s"
|
||||
p - persistent memory
|
||||
s - SSD
|
||||
|
||||
- p - persistent memory
|
||||
- s - SSD
|
||||
2. the underlying device that will be cached
|
||||
3. the cache device
|
||||
4. block size (4096 is recommended; the maximum block size is the page
|
||||
size)
|
||||
5. the number of optional parameters (the parameters with an argument
|
||||
count as two)
|
||||
|
||||
start_sector n (default: 0)
|
||||
offset from the start of cache device in 512-byte sectors
|
||||
high_watermark n (default: 50)
|
||||
|
@ -43,6 +50,7 @@ Constructor parameters:
|
|||
applicable only to persistent memory - don't use the FUA
|
||||
flag when writing back data and send the FLUSH request
|
||||
afterwards
|
||||
|
||||
- some underlying devices perform better with fua, some
|
||||
with nofua. The user should test it
|
||||
|
||||
|
@ -60,6 +68,7 @@ Messages:
|
|||
flush the cache device on next suspend. Use this message
|
||||
when you are going to remove the cache device. The proper
|
||||
sequence for removing the cache device is:
|
||||
|
||||
1. send the "flush_on_suspend" message
|
||||
2. load an inactive table with a linear target that maps
|
||||
to the underlying device
|
|
@ -1,3 +1,4 @@
|
|||
=======
|
||||
dm-zero
|
||||
=======
|
||||
|
||||
|
@ -18,20 +19,19 @@ filesystem limitations.
|
|||
|
||||
To create a sparse device, start by creating a dm-zero device that's the
|
||||
desired size of the sparse device. For this example, we'll assume a 10TB
|
||||
sparse device.
|
||||
sparse device::
|
||||
|
||||
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
|
||||
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
|
||||
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
|
||||
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
|
||||
|
||||
Then create a snapshot of the zero device, using any available block-device as
|
||||
the COW device. The size of the COW device will determine the amount of real
|
||||
space available to the sparse device. For this example, we'll assume /dev/sdb1
|
||||
is an available 10GB partition.
|
||||
is an available 10GB partition::
|
||||
|
||||
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
|
||||
dmsetup create sparse1
|
||||
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
|
||||
dmsetup create sparse1
|
||||
|
||||
This will create a 10TB sparse device called /dev/mapper/sparse1 that has
|
||||
10GB of actual storage space available. If more than 10GB of data is written
|
||||
to this device, it will start returning I/O errors.
|
||||
|
|
@ -417,9 +417,9 @@ will then have to be provided beforehand in the normal way.
|
|||
|
||||
[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
|
||||
|
||||
[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt
|
||||
[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst
|
||||
|
||||
[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt
|
||||
[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.rst
|
||||
|
||||
[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html
|
||||
|
||||
|
|
|
@ -453,7 +453,7 @@ config DM_INIT
|
|||
Enable "dm-mod.create=" parameter to create mapped devices at init time.
|
||||
This option is useful to allow mounting rootfs without requiring an
|
||||
initramfs.
|
||||
See Documentation/device-mapper/dm-init.txt for dm-mod.create="..."
|
||||
See Documentation/device-mapper/dm-init.rst for dm-mod.create="..."
|
||||
format.
|
||||
|
||||
If unsure, say N.
|
||||
|
|
|
@ -25,7 +25,7 @@ static char *create;
|
|||
* Format: dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
|
||||
* Table format: <start_sector> <num_sectors> <target_type> <target_args>
|
||||
*
|
||||
* See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." format
|
||||
* See Documentation/device-mapper/dm-init.rst for dm-mod.create="..." format
|
||||
* details.
|
||||
*/
|
||||
|
||||
|
|
|
@ -3558,7 +3558,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
|
|||
* v1.5.0+:
|
||||
*
|
||||
* Sync action:
|
||||
* See Documentation/device-mapper/dm-raid.txt for
|
||||
* See Documentation/device-mapper/dm-raid.rst for
|
||||
* information on each of these states.
|
||||
*/
|
||||
DMEMIT(" %s", sync_action);
|
||||
|
|
Loading…
Reference in New Issue