334 lines
13 KiB
ReStructuredText
334 lines
13 KiB
ReStructuredText
|
.. SPDX-License-Identifier: GPL-2.0-only
|
||
|
|
||
|
========
|
||
|
dm-clone
|
||
|
========
|
||
|
|
||
|
Introduction
|
||
|
============
|
||
|
|
||
|
dm-clone is a device mapper target which produces a one-to-one copy of an
|
||
|
existing, read-only source device into a writable destination device: It
|
||
|
presents a virtual block device which makes all data appear immediately, and
|
||
|
redirects reads and writes accordingly.
|
||
|
|
||
|
The main use case of dm-clone is to clone a potentially remote, high-latency,
|
||
|
read-only, archival-type block device into a writable, fast, primary-type device
|
||
|
for fast, low-latency I/O. The cloned device is visible/mountable immediately
|
||
|
and the copy of the source device to the destination device happens in the
|
||
|
background, in parallel with user I/O.
|
||
|
|
||
|
For example, one could restore an application backup from a read-only copy,
|
||
|
accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE,
|
||
|
etc.), into a local SSD or NVMe device, and start using the device immediately,
|
||
|
without waiting for the restore to complete.
|
||
|
|
||
|
When the cloning completes, the dm-clone table can be removed altogether and be
|
||
|
replaced, e.g., by a linear table, mapping directly to the destination device.
|
||
|
|
||
|
The dm-clone target reuses the metadata library used by the thin-provisioning
|
||
|
target.
|
||
|
|
||
|
Glossary
|
||
|
========
|
||
|
|
||
|
Hydration
|
||
|
The process of filling a region of the destination device with data from
|
||
|
the same region of the source device, i.e., copying the region from the
|
||
|
source to the destination device.
|
||
|
|
||
|
Once a region gets hydrated we redirect all I/O regarding it to the destination
|
||
|
device.
|
||
|
|
||
|
Design
|
||
|
======
|
||
|
|
||
|
Sub-devices
|
||
|
-----------
|
||
|
|
||
|
The target is constructed by passing three devices to it (along with other
|
||
|
parameters detailed later):
|
||
|
|
||
|
1. A source device - the read-only device that gets cloned and source of the
|
||
|
hydration.
|
||
|
|
||
|
2. A destination device - the destination of the hydration, which will become a
|
||
|
clone of the source device.
|
||
|
|
||
|
3. A small metadata device - it records which regions are already valid in the
|
||
|
destination device, i.e., which regions have already been hydrated, or have
|
||
|
been written to directly, via user I/O.
|
||
|
|
||
|
The size of the destination device must be at least equal to the size of the
|
||
|
source device.
|
||
|
|
||
|
Regions
|
||
|
-------
|
||
|
|
||
|
dm-clone divides the source and destination devices in fixed sized regions.
|
||
|
Regions are the unit of hydration, i.e., the minimum amount of data copied from
|
||
|
the source to the destination device.
|
||
|
|
||
|
The region size is configurable when you first create the dm-clone device. The
|
||
|
recommended region size is the same as the file system block size, which usually
|
||
|
is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors
|
||
|
(1GB) and a power of two.
|
||
|
|
||
|
Reads and writes from/to hydrated regions are serviced from the destination
|
||
|
device.
|
||
|
|
||
|
A read to a not yet hydrated region is serviced directly from the source device.
|
||
|
|
||
|
A write to a not yet hydrated region will be delayed until the corresponding
|
||
|
region has been hydrated and the hydration of the region starts immediately.
|
||
|
|
||
|
Note that a write request with size equal to region size will skip copying of
|
||
|
the corresponding region from the source device and overwrite the region of the
|
||
|
destination device directly.
|
||
|
|
||
|
Discards
|
||
|
--------
|
||
|
|
||
|
dm-clone interprets a discard request to a range that hasn't been hydrated yet
|
||
|
as a hint to skip hydration of the regions covered by the request, i.e., it
|
||
|
skips copying the region's data from the source to the destination device, and
|
||
|
only updates its metadata.
|
||
|
|
||
|
If the destination device supports discards, then by default dm-clone will pass
|
||
|
down discard requests to it.
|
||
|
|
||
|
Background Hydration
|
||
|
--------------------
|
||
|
|
||
|
dm-clone copies continuously from the source to the destination device, until
|
||
|
all of the device has been copied.
|
||
|
|
||
|
Copying data from the source to the destination device uses bandwidth. The user
|
||
|
can set a throttle to prevent more than a certain amount of copying occurring at
|
||
|
any one time. Moreover, dm-clone takes into account user I/O traffic going to
|
||
|
the devices and pauses the background hydration when there is I/O in-flight.
|
||
|
|
||
|
A message `hydration_threshold <#regions>` can be used to set the maximum number
|
||
|
of regions being copied, the default being 1 region.
|
||
|
|
||
|
dm-clone employs dm-kcopyd for copying portions of the source device to the
|
||
|
destination device. By default, we issue copy requests of size equal to the
|
||
|
region size. A message `hydration_batch_size <#regions>` can be used to tune the
|
||
|
size of these copy requests. Increasing the hydration batch size results in
|
||
|
dm-clone trying to batch together contiguous regions, so we copy the data in
|
||
|
batches of this many regions.
|
||
|
|
||
|
When the hydration of the destination device finishes, a dm event will be sent
|
||
|
to user space.
|
||
|
|
||
|
Updating on-disk metadata
|
||
|
-------------------------
|
||
|
|
||
|
On-disk metadata is committed every time a FLUSH or FUA bio is written. If no
|
||
|
such requests are made then commits will occur every second. This means the
|
||
|
dm-clone device behaves like a physical disk that has a volatile write cache. If
|
||
|
power is lost you may lose some recent writes. The metadata should always be
|
||
|
consistent in spite of any crash.
|
||
|
|
||
|
Target Interface
|
||
|
================
|
||
|
|
||
|
Constructor
|
||
|
-----------
|
||
|
|
||
|
::
|
||
|
|
||
|
clone <metadata dev> <destination dev> <source dev> <region size>
|
||
|
[<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]]
|
||
|
|
||
|
================ ==============================================================
|
||
|
metadata dev Fast device holding the persistent metadata
|
||
|
destination dev The destination device, where the source will be cloned
|
||
|
source dev Read only device containing the data that gets cloned
|
||
|
region size The size of a region in sectors
|
||
|
|
||
|
#feature args Number of feature arguments passed
|
||
|
feature args no_hydration or no_discard_passdown
|
||
|
|
||
|
#core args An even number of arguments corresponding to key/value pairs
|
||
|
passed to dm-clone
|
||
|
core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold
|
||
|
256`
|
||
|
================ ==============================================================
|
||
|
|
||
|
Optional feature arguments are:
|
||
|
|
||
|
==================== =========================================================
|
||
|
no_hydration Create a dm-clone instance with background hydration
|
||
|
disabled
|
||
|
no_discard_passdown Disable passing down discards to the destination device
|
||
|
==================== =========================================================
|
||
|
|
||
|
Optional core arguments are:
|
||
|
|
||
|
================================ ==============================================
|
||
|
hydration_threshold <#regions> Maximum number of regions being copied from
|
||
|
the source to the destination device at any
|
||
|
one time, during background hydration.
|
||
|
hydration_batch_size <#regions> During background hydration, try to batch
|
||
|
together contiguous regions, so we copy data
|
||
|
from the source to the destination device in
|
||
|
batches of this many regions.
|
||
|
================================ ==============================================
|
||
|
|
||
|
Status
|
||
|
------
|
||
|
|
||
|
::
|
||
|
|
||
|
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||
|
<region size> <#hydrated regions>/<#total regions> <#hydrating regions>
|
||
|
<#feature args> <feature args>* <#core args> <core args>*
|
||
|
<clone metadata mode>
|
||
|
|
||
|
======================= =======================================================
|
||
|
metadata block size Fixed block size for each metadata block in sectors
|
||
|
#used metadata blocks Number of metadata blocks used
|
||
|
#total metadata blocks Total number of metadata blocks
|
||
|
region size Configurable region size for the device in sectors
|
||
|
#hydrated regions Number of regions that have finished hydrating
|
||
|
#total regions Total number of regions to hydrate
|
||
|
#hydrating regions Number of regions currently hydrating
|
||
|
#feature args Number of feature arguments to follow
|
||
|
feature args Feature arguments, e.g. `no_hydration`
|
||
|
#core args Even number of core arguments to follow
|
||
|
core args Key/value pairs for tuning the core, e.g.
|
||
|
`hydration_threshold 256`
|
||
|
clone metadata mode ro if read-only, rw if read-write
|
||
|
|
||
|
In serious cases where even a read-only mode is deemed
|
||
|
unsafe no further I/O will be permitted and the status
|
||
|
will just contain the string 'Fail'. If the metadata
|
||
|
mode changes, a dm event will be sent to user space.
|
||
|
======================= =======================================================
|
||
|
|
||
|
Messages
|
||
|
--------
|
||
|
|
||
|
`disable_hydration`
|
||
|
Disable the background hydration of the destination device.
|
||
|
|
||
|
`enable_hydration`
|
||
|
Enable the background hydration of the destination device.
|
||
|
|
||
|
`hydration_threshold <#regions>`
|
||
|
Set background hydration threshold.
|
||
|
|
||
|
`hydration_batch_size <#regions>`
|
||
|
Set background hydration batch size.
|
||
|
|
||
|
Examples
|
||
|
========
|
||
|
|
||
|
Clone a device containing a file system
|
||
|
---------------------------------------
|
||
|
|
||
|
1. Create the dm-clone device.
|
||
|
|
||
|
::
|
||
|
|
||
|
dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \
|
||
|
$source_dev 8 1 no_hydration"
|
||
|
|
||
|
2. Mount the device and trim the file system. dm-clone interprets the discards
|
||
|
sent by the file system and it will not hydrate the unused space.
|
||
|
|
||
|
::
|
||
|
|
||
|
mount /dev/mapper/clone /mnt/cloned-fs
|
||
|
fstrim /mnt/cloned-fs
|
||
|
|
||
|
3. Enable background hydration of the destination device.
|
||
|
|
||
|
::
|
||
|
|
||
|
dmsetup message clone 0 enable_hydration
|
||
|
|
||
|
4. When the hydration finishes, we can replace the dm-clone table with a linear
|
||
|
table.
|
||
|
|
||
|
::
|
||
|
|
||
|
dmsetup suspend clone
|
||
|
dmsetup load clone --table "0 1048576000 linear $dest_dev 0"
|
||
|
dmsetup resume clone
|
||
|
|
||
|
The metadata device is no longer needed and can be safely discarded or reused
|
||
|
for other purposes.
|
||
|
|
||
|
Known issues
|
||
|
============
|
||
|
|
||
|
1. We redirect reads, to not-yet-hydrated regions, to the source device. If
|
||
|
reading the source device has high latency and the user repeatedly reads from
|
||
|
the same regions, this behaviour could degrade performance. We should use
|
||
|
these reads as hints to hydrate the relevant regions sooner. Currently, we
|
||
|
rely on the page cache to cache these regions, so we hopefully don't end up
|
||
|
reading them multiple times from the source device.
|
||
|
|
||
|
2. Release in-core resources, i.e., the bitmaps tracking which regions are
|
||
|
hydrated, after the hydration has finished.
|
||
|
|
||
|
3. During background hydration, if we fail to read the source or write to the
|
||
|
destination device, we print an error message, but the hydration process
|
||
|
continues indefinitely, until it succeeds. We should stop the background
|
||
|
hydration after a number of failures and emit a dm event for user space to
|
||
|
notice.
|
||
|
|
||
|
Why not...?
|
||
|
===========
|
||
|
|
||
|
We explored the following alternatives before implementing dm-clone:
|
||
|
|
||
|
1. Use dm-cache with cache size equal to the source device and implement a new
|
||
|
cloning policy:
|
||
|
|
||
|
* The resulting cache device is not a one-to-one mirror of the source device
|
||
|
and thus we cannot remove the cache device once cloning completes.
|
||
|
|
||
|
* dm-cache writes to the source device, which violates our requirement that
|
||
|
the source device must be treated as read-only.
|
||
|
|
||
|
* Caching is semantically different from cloning.
|
||
|
|
||
|
2. Use dm-snapshot with a COW device equal to the source device:
|
||
|
|
||
|
* dm-snapshot stores its metadata in the COW device, so the resulting device
|
||
|
is not a one-to-one mirror of the source device.
|
||
|
|
||
|
* No background copying mechanism.
|
||
|
|
||
|
* dm-snapshot needs to commit its metadata whenever a pending exception
|
||
|
completes, to ensure snapshot consistency. In the case of cloning, we don't
|
||
|
need to be so strict and can rely on committing metadata every time a FLUSH
|
||
|
or FUA bio is written, or periodically, like dm-thin and dm-cache do. This
|
||
|
improves the performance significantly.
|
||
|
|
||
|
3. Use dm-mirror: The mirror target has a background copying/mirroring
|
||
|
mechanism, but it writes to all mirrors, thus violating our requirement that
|
||
|
the source device must be treated as read-only.
|
||
|
|
||
|
4. Use dm-thin's external snapshot functionality. This approach is the most
|
||
|
promising among all alternatives, as the thinly-provisioned volume is a
|
||
|
one-to-one mirror of the source device and handles reads and writes to
|
||
|
un-provisioned/not-yet-cloned areas the same way as dm-clone does.
|
||
|
|
||
|
Still:
|
||
|
|
||
|
* There is no background copying mechanism, though one could be implemented.
|
||
|
|
||
|
* Most importantly, we want to support arbitrary block devices as the
|
||
|
destination of the cloning process and not restrict ourselves to
|
||
|
thinly-provisioned volumes. Thin-provisioning has an inherent metadata
|
||
|
overhead, for maintaining the thin volume mappings, which significantly
|
||
|
degrades performance.
|
||
|
|
||
|
Moreover, cloning a device shouldn't force the use of thin-provisioning. On
|
||
|
the other hand, if we wish to use thin provisioning, we can just use a thin
|
||
|
LV as dm-clone's destination device.
|