libnvdimm: documentation clarifications
A bunch of changes that I hope will help in understanding it better for first-time readers. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This commit is contained in:
parent
589e75d157
commit
8de5dff8ba
|
@ -62,6 +62,12 @@ DAX: File system extensions to bypass the page cache and block layer to
|
||||||
mmap persistent memory, from a PMEM block device, directly into a
|
mmap persistent memory, from a PMEM block device, directly into a
|
||||||
process address space.
|
process address space.
|
||||||
|
|
||||||
|
DSM: Device Specific Method: ACPI method to to control specific
|
||||||
|
device - in this case the firmware.
|
||||||
|
|
||||||
|
DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
|
||||||
|
It defines a vendor-id, device-id, and interface format for a given DIMM.
|
||||||
|
|
||||||
BTT: Block Translation Table: Persistent memory is byte addressable.
|
BTT: Block Translation Table: Persistent memory is byte addressable.
|
||||||
Existing software may have an expectation that the power-fail-atomicity
|
Existing software may have an expectation that the power-fail-atomicity
|
||||||
of writes is at least one sector, 512 bytes. The BTT is an indirection
|
of writes is at least one sector, 512 bytes. The BTT is an indirection
|
||||||
|
@ -133,16 +139,16 @@ device driver:
|
||||||
registered, can be immediately attached to nd_pmem.
|
registered, can be immediately attached to nd_pmem.
|
||||||
|
|
||||||
2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
|
2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
|
||||||
defined apertures. A set of apertures will all access just one DIMM.
|
defined apertures. A set of apertures will access just one DIMM.
|
||||||
Multiple windows allow multiple concurrent accesses, much like
|
Multiple windows (apertures) allow multiple concurrent accesses, much like
|
||||||
tagged-command-queuing, and would likely be used by different threads or
|
tagged-command-queuing, and would likely be used by different threads or
|
||||||
different CPUs.
|
different CPUs.
|
||||||
|
|
||||||
The NFIT specification defines a standard format for a BLK-aperture, but
|
The NFIT specification defines a standard format for a BLK-aperture, but
|
||||||
the spec also allows for vendor specific layouts, and non-NFIT BLK
|
the spec also allows for vendor specific layouts, and non-NFIT BLK
|
||||||
implementations may other designs for BLK I/O. For this reason "nd_blk"
|
implementations may have other designs for BLK I/O. For this reason
|
||||||
calls back into platform-specific code to perform the I/O. One such
|
"nd_blk" calls back into platform-specific code to perform the I/O.
|
||||||
implementation is defined in the "Driver Writer's Guide" and "DSM
|
One such implementation is defined in the "Driver Writer's Guide" and "DSM
|
||||||
Interface Example".
|
Interface Example".
|
||||||
|
|
||||||
|
|
||||||
|
@ -152,7 +158,7 @@ Why BLK?
|
||||||
While PMEM provides direct byte-addressable CPU-load/store access to
|
While PMEM provides direct byte-addressable CPU-load/store access to
|
||||||
NVDIMM storage, it does not provide the best system RAS (recovery,
|
NVDIMM storage, it does not provide the best system RAS (recovery,
|
||||||
availability, and serviceability) model. An access to a corrupted
|
availability, and serviceability) model. An access to a corrupted
|
||||||
system-physical-address address causes a cpu exception while an access
|
system-physical-address address causes a CPU exception while an access
|
||||||
to a corrupted address through an BLK-aperture causes that block window
|
to a corrupted address through an BLK-aperture causes that block window
|
||||||
to raise an error status in a register. The latter is more aligned with
|
to raise an error status in a register. The latter is more aligned with
|
||||||
the standard error model that host-bus-adapter attached disks present.
|
the standard error model that host-bus-adapter attached disks present.
|
||||||
|
@ -162,7 +168,7 @@ data could be interleaved in an opaque hardware specific manner across
|
||||||
several DIMMs.
|
several DIMMs.
|
||||||
|
|
||||||
PMEM vs BLK
|
PMEM vs BLK
|
||||||
BLK-apertures solve this RAS problem, but their presence is also the
|
BLK-apertures solve these RAS problems, but their presence is also the
|
||||||
major contributing factor to the complexity of the ND subsystem. They
|
major contributing factor to the complexity of the ND subsystem. They
|
||||||
complicate the implementation because PMEM and BLK alias in DPA space.
|
complicate the implementation because PMEM and BLK alias in DPA space.
|
||||||
Any given DIMM's DPA-range may contribute to one or more
|
Any given DIMM's DPA-range may contribute to one or more
|
||||||
|
@ -220,8 +226,8 @@ socket. Each unique interface (BLK or PMEM) to DPA space is identified
|
||||||
by a region device with a dynamically assigned id (REGION0 - REGION5).
|
by a region device with a dynamically assigned id (REGION0 - REGION5).
|
||||||
|
|
||||||
1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
|
1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
|
||||||
single PMEM namespace is created in the REGION0-SPA-range that spans
|
single PMEM namespace is created in the REGION0-SPA-range that spans most
|
||||||
DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
|
of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
|
||||||
interleaved system-physical-address range is reclaimed as BLK-aperture
|
interleaved system-physical-address range is reclaimed as BLK-aperture
|
||||||
accessed space starting at DPA-offset (a) into each DIMM. In that
|
accessed space starting at DPA-offset (a) into each DIMM. In that
|
||||||
reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
|
reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
|
||||||
|
@ -230,13 +236,13 @@ by a region device with a dynamically assigned id (REGION0 - REGION5).
|
||||||
|
|
||||||
2. In the last portion of DIMM0 and DIMM1 we have an interleaved
|
2. In the last portion of DIMM0 and DIMM1 we have an interleaved
|
||||||
system-physical-address range, REGION1, that spans those two DIMMs as
|
system-physical-address range, REGION1, that spans those two DIMMs as
|
||||||
well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace
|
well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace
|
||||||
named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for
|
named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
|
||||||
each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
|
each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
|
||||||
"blk5.0".
|
"blk5.0".
|
||||||
|
|
||||||
3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
|
3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
|
||||||
interleaved system-physical-address range (i.e. the DPA address below
|
interleaved system-physical-address range (i.e. the DPA address past
|
||||||
offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
|
offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
|
||||||
Note, that this example shows that BLK-aperture namespaces don't need to
|
Note, that this example shows that BLK-aperture namespaces don't need to
|
||||||
be contiguous in DPA-space.
|
be contiguous in DPA-space.
|
||||||
|
@ -252,15 +258,15 @@ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
|
||||||
|
|
||||||
What follows is a description of the LIBNVDIMM sysfs layout and a
|
What follows is a description of the LIBNVDIMM sysfs layout and a
|
||||||
corresponding object hierarchy diagram as viewed through the LIBNDCTL
|
corresponding object hierarchy diagram as viewed through the LIBNDCTL
|
||||||
api. The example sysfs paths and diagrams are relative to the Example
|
API. The example sysfs paths and diagrams are relative to the Example
|
||||||
NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
|
NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
|
||||||
test.
|
test.
|
||||||
|
|
||||||
LIBNDCTL: Context
|
LIBNDCTL: Context
|
||||||
Every api call in the LIBNDCTL library requires a context that holds the
|
Every API call in the LIBNDCTL library requires a context that holds the
|
||||||
logging parameters and other library instance state. The library is
|
logging parameters and other library instance state. The library is
|
||||||
based on the libabc template:
|
based on the libabc template:
|
||||||
https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/
|
https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
|
||||||
|
|
||||||
LIBNDCTL: instantiate a new library context example
|
LIBNDCTL: instantiate a new library context example
|
||||||
|
|
||||||
|
@ -409,7 +415,7 @@ Bit 31:28 Reserved
|
||||||
LIBNVDIMM/LIBNDCTL: Region
|
LIBNVDIMM/LIBNDCTL: Region
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
A generic REGION device is registered for each PMEM range orBLK-aperture
|
A generic REGION device is registered for each PMEM range or BLK-aperture
|
||||||
set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
|
set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
|
||||||
sets on the "nfit_test.0" bus. The primary role of regions are to be a
|
sets on the "nfit_test.0" bus. The primary role of regions are to be a
|
||||||
container of "mappings". A mapping is a tuple of <DIMM,
|
container of "mappings". A mapping is a tuple of <DIMM,
|
||||||
|
@ -509,7 +515,7 @@ At first glance it seems since NFIT defines just PMEM and BLK interface
|
||||||
types that we should simply name REGION devices with something derived
|
types that we should simply name REGION devices with something derived
|
||||||
from those type names. However, the ND subsystem explicitly keeps the
|
from those type names. However, the ND subsystem explicitly keeps the
|
||||||
REGION name generic and expects userspace to always consider the
|
REGION name generic and expects userspace to always consider the
|
||||||
region-attributes for 4 reasons:
|
region-attributes for four reasons:
|
||||||
|
|
||||||
1. There are already more than two REGION and "namespace" types. For
|
1. There are already more than two REGION and "namespace" types. For
|
||||||
PMEM there are two subtypes. As mentioned previously we have PMEM where
|
PMEM there are two subtypes. As mentioned previously we have PMEM where
|
||||||
|
@ -698,8 +704,8 @@ static int configure_namespace(struct ndctl_region *region,
|
||||||
|
|
||||||
Why the Term "namespace"?
|
Why the Term "namespace"?
|
||||||
|
|
||||||
1. Why not "volume" for instance? "volume" ran the risk of confusing ND
|
1. Why not "volume" for instance? "volume" ran the risk of confusing
|
||||||
as a volume manager like device-mapper.
|
ND (libnvdimm subsystem) to a volume manager like device-mapper.
|
||||||
|
|
||||||
2. The term originated to describe the sub-devices that can be created
|
2. The term originated to describe the sub-devices that can be created
|
||||||
within a NVME controller (see the nvme specification:
|
within a NVME controller (see the nvme specification:
|
||||||
|
@ -774,13 +780,14 @@ block" needs to be destroyed. Note, that to destroy a BTT the media
|
||||||
needs to be written in raw mode. By default, the kernel will autodetect
|
needs to be written in raw mode. By default, the kernel will autodetect
|
||||||
the presence of a BTT and disable raw mode. This autodetect behavior
|
the presence of a BTT and disable raw mode. This autodetect behavior
|
||||||
can be suppressed by enabling raw mode for the namespace via the
|
can be suppressed by enabling raw mode for the namespace via the
|
||||||
ndctl_namespace_set_raw_mode() api.
|
ndctl_namespace_set_raw_mode() API.
|
||||||
|
|
||||||
|
|
||||||
Summary LIBNDCTL Diagram
|
Summary LIBNDCTL Diagram
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
For the given example above, here is the view of the objects as seen by the LIBNDCTL api:
|
For the given example above, here is the view of the objects as seen by the
|
||||||
|
LIBNDCTL API:
|
||||||
+---+
|
+---+
|
||||||
|CTX| +---------+ +--------------+ +---------------+
|
|CTX| +---------+ +--------------+ +---------------+
|
||||||
+-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
|
+-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
|
||||||
|
|
Loading…
Reference in New Issue