- Fix duplicate device_unregister() calls (multiple threads competing to
do unregister work when scheduling device removal from a sysfs attribute
of the self-same device).
- Fix badblocks registration order bug. Ensure region badblocks are
initialized in advance of namespace registration.
- Fix a deadlock between the bus lock and probe operations.
- Export device-core infrastructure to coordinate async operations via
the device ->dead state.
- Add device-core infrastructure to validate device_lock() usage with
lockdep.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdO39XAAoJEB7SkWpmfYgCzbEQAJigRJecrz+OyICGmIAeNSy5
hF6Cv+TPuccpnINNaULS7aJStv4Zl/3SxG5GkivKDk11Xs02VrLzv1m3nDxEOVwc
6LwRwcM7U3UtROzI5gjfT5StgBU4xvlQYKiYV5oxAXoQ5amApqbl3NgfH3qmCaXR
QqWhd7v7TiNZ1QWlnmRBw+j0YLbS1dHyaSAf4KZwnL6fVKmqxtfDxny5tG6jdDuq
olPue6nFAA+ebxyAsKR9VQVmcxDwuG0bJ/GUD6IeOQp/Eh6hcv2AfcVjp4Iwn/aM
n1dIXASFwKr6DoOXZgnUbfXMVGzq1qKHPNgzUvtK6SApZlcm+TnyIOfj0/6BNp9q
Bae1RMRwo5Wa5oAQed3CutvUUQAPa5WrW95E0/4T+dkcutkRnxL6akn/c87qQ4nL
F30zpL8U4UdeaJ5maEIqJ/mtAc9deHiFnO/k216+xvDcY3NGqvzY4PsUBAMep8i2
FgoaBr0hmTkb0KTMI858ChQrT+sjqwJIa854g7b4VxrQz93WYPABRK9ZhMSBEJ8b
rGCeNqvvq0G6dSN6e8bS6P/4EEk76nZAJUYKoMYmj3WuwYuY4Sxb86eFIudNeSEe
EqRGaefaZrqEL6LJTHScCk+55BgYSEOrDdip1lSWGdNHjvgZeIOZrgCrqrm/H72c
mkoCAzdA4drQ0D4ZbKrC
=mhIp
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A collection of locking and async operations fixes for v5.3-rc2. These
had been soaking in a branch targeting the merge window, but missed
due to a regression hunt. This fixed up version has otherwise been in
-next this past week with no reported issues.
In order to gain confidence in the locking changes the pull also
includes a debug / instrumentation patch to enable lockdep coverage
for libnvdimm subsystem operations that depend on the device_lock for
exclusion. As mentioned in the changelog it is a hack, but it works
and documents the locking expectations of the sub-system in a way that
others can use lockdep to verify. The driver core touches got an ack
from Greg.
Summary:
- Fix duplicate device_unregister() calls (multiple threads competing
to do unregister work when scheduling device removal from a sysfs
attribute of the self-same device).
- Fix badblocks registration order bug. Ensure region badblocks are
initialized in advance of namespace registration.
- Fix a deadlock between the bus lock and probe operations.
- Export device-core infrastructure to coordinate async operations
via the device ->dead state.
- Add device-core infrastructure to validate device_lock() usage with
lockdep"
* tag 'libnvdimm-fixes-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
driver-core, libnvdimm: Let device subsystems add local lockdep coverage
libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
libnvdimm/region: Register badblocks before namespaces
libnvdimm/bus: Prevent duplicate device_unregister() calls
drivers/base: Introduce kill_device()
For good reason, the standard device_lock() is marked
lockdep_set_novalidate_class() because there is simply no sane way to
describe the myriad ways the device_lock() ordered with other locks.
However, that leaves subsystems that know their own local device_lock()
ordering rules to find lock ordering mistakes manually. Instead,
introduce an optional / additional lockdep-enabled lock that a subsystem
can acquire in all the same paths that the device_lock() is acquired.
A conversion of the NFIT driver and NVDIMM subsystem to a
lockdep-validate device_lock() scheme is included. The
debug_nvdimm_lock() implementation implements the correct lock-class and
stacking order for the libnvdimm device topology hierarchy.
Yes, this is a hack, but hopefully it is a useful hack for other
subsystems device_lock() debug sessions. Quoting Greg:
"Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
using it as much as anything else, so user beware :)
I don't object to it if it makes things easier for you to debug."
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com
This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of version 2 of the gnu general public license as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 64 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With zero-key defined, we can remove previous detection of key id 0 or null
key in order to deal with a zero-key situation. Syncing all security
commands to use the zero-key. Helper functions are introduced to return the
data that points to the actual key payload or the zero_key. This helps
uniformly handle the key material even with zero_key.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The dynamic-debug statements for command payload output only get emitted
when the command is not ND_CMD_CALL. Move the output payload dumping
ahead of the early return path for ND_CMD_CALL.
Fixes: 31eca76ba2 ("...whitelisted dimm command marshaling mechanism")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
a compat driver so distributions can opt-in to the new ABI.
* Allow for an alternative driver for the device-dax address-range
* Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
* Arrange for the device-dax target-node to be onlined so that the newly
added memory range can be uniquely referenced by numa apis.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJchWpGAAoJEB7SkWpmfYgCJk8P/0Q1DINszUDO/vKjJ09cDs9P
Jw3it6GBIL50rDOu9QdcprSpwYDD0h1mLAV/m6oa3bVO+p4uWGvnxaxRx2HN2c/v
vhZFtUDpHlqR63vzWMNVKRprYixCRJDUr6xQhhCcE3ak/ELN6w7LWfikKVWv15UL
MfR96IQU38f+xRda/zSXnL9606Dvkvu/inEHj84lRcHIwj3sQAUalrE8bR3O32gZ
bDg/l5kzT49o8ZXUo/TegvRSSSZpJmOl2DD0RW+ax5q3NI2bOXFrVDUKBKxf/hcQ
E/V9i57TrqQx0GqRhnU7rN/v53cFZGGs31TEEIB/xs3bzCnADxwXcjL5b5K005J6
vJjBA2ODBewHFK3uVx46Hy1iV4eCtZWj4QrMnrjdSrjXOfbF5GTbWOhPFgoq7TWf
S7VqFEf3I2gDPaMq4o8Ej1kLH4HMYeor2NSOZjyvGn87rSZ3ZIQguwbaNIVl+itz
gdDt0ZOU0BgOBkV+rZIeZDaGdloWCHcDPL15CkZaOZyzdWhfEZ7dod6ad+9udilU
EUPH62RgzXZtfm5zpebYyjNVLbb9pLZ0nT+UypyGR6zqWx1SqU3mXi63NFXPco+x
XA9j//edPeI6NHg2CXLEh8DLuCg3dG1zWRJANkiF+niBwyCR8CHtGWAoY6soXbKe
2UrXGcIfXxyJ8V9v8v4q
=hfa3
-----END PGP SIGNATURE-----
Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax updates from Dan Williams:
"New device-dax infrastructure to allow persistent memory and other
"reserved" / performance differentiated memories, to be assigned to
the core-mm as "System RAM".
Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.
One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.
Summary:
- Replace the /sys/class/dax device model with /sys/bus/dax, and
include a compat driver so distributions can opt-in to the new ABI.
- Allow for an alternative driver for the device-dax address-range
- Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
- Arrange for the device-dax target-node to be onlined so that the
newly added memory range can be uniquely referenced by numa apis"
NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.
And apparently the PMEM modules can cause that a lot more than regular
RAM. The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.
Quoting Dan from another email:
"The exposure can be reduced in the volatile-RAM case by scanning for
and clearing errors before it is onlined as RAM. The userspace tooling
for that can be in place before v5.1-final. There's also runtime
notifications of errors via acpi_nfit_uc_error_notify() from
background scrubbers on the DIMM devices. With that mechanism the
kernel could proactively clear newly discovered poison in the volatile
case, but that would be additional development more suitable for v5.2.
I understand the concern, and the need to highlight this issue by
tapping the brakes on feature development, but I don't see PMEM as RAM
making the situation worse when the exposure is also there via DAX in
the PMEM case. Volatile-RAM is arguably a safer use case since it's
possible to repair pages where the persistent case needs active
application coordination"
* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: "Hotplug" persistent memory for use like normal RAM
mm/resource: Let walk_system_ram_range() search child resources
mm/memory-hotplug: Allow memory resources to be children
mm/resource: Move HMM pr_debug() deeper into resource code
mm/resource: Return real error codes from walk failures
device-dax: Add a 'modalias' attribute to DAX 'bus' devices
device-dax: Add a 'target_node' attribute
device-dax: Auto-bind device after successful new_id
acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
device-dax: Add /sys/class/dax backwards compatibility
device-dax: Add support for a dax override driver
device-dax: Move resource pinning+mapping into the common driver
device-dax: Introduce bus + driver model
device-dax: Start defining a dax bus model
device-dax: Remove multi-resource infrastructure
device-dax: Kill dax_region base
device-dax: Kill dax_region ida
Merge several updates to the ARS implementation. Highlights include:
* Support retrieval of short-ARS results if the ARS state is "requires
continuation", and even if the "no_init_ars" module parameter is
specified.
* Allow busy-polling of the kernel ARS state by allowing root to reset
the exponential back-off timer.
* Filter potentially stale ARS results by tracking query-ARS relative to
the previous start-ARS.
Merge miscellaneous libnvdimm sub-system updates for v5.1. Highlights
include:
* Support for the Hyper-V family of device-specific-methods (DSMs)
* Several fixes and workarounds for Hyper-V compatibility.
* Fix for the support to cache the dirty-shutdown-count at init.
ACPI NFIT flags field reports major errors on NVDIMM, which need
user's attention.
Update the current log to a proper error message with dev_err().
The current message string is kept for grep-compatibility.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Gate ARS result consumption on whether the OS issued start-ARS since the
previous consumption. The BIOS may only clear its result buffers after a
successful start-ARS.
Fixes: 0caeef63e6 ("libnvdimm: Add a poison list and export badblocks")
Cc: <stable@vger.kernel.org>
Reported-by: Krzysztof Rusocki <krzysztof.rusocki@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The ARS implementation implements exponential back-off on the poll
interval to prevent high-frequency access to the DIMM / platform
interface. Depending on when the ARS completes the poll interval may
exceed the completion event by minutes. Allow root to reset the timeout
each time it probes the status. A one-second timeout is still enforced,
but root can otherwise can control the poll interval.
Fixes: bc6ba80858 ("nfit, address-range-scrub: rework and simplify ARS...")
Cc: <stable@vger.kernel.org>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for introducing new flags to gate whether ARS results are
stale, or poll the completion state, convert the existing flags to an
unsigned long with enumerated values. This conversion allows the flags
to be atomically updated outside of ->init_mutex.
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The ars_start_flags property of 'struct acpi_nfit_desc' is no longer
used since ARS_REQ_SHORT and ARS_REQ_LONG were added.
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The no_init_ars option is meant to prevent long-ARS, but short-ARS
should be allowed to grab any immediate results.
Fixes: bc6ba80858 ("nfit, address-range-scrub: rework and simplify ARS...")
Cc: <stable@vger.kernel.org>
Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If query-ARS reports that ARS has stopped and requires continuation
attempt to retrieve short-ARS results before continuing the long
operation.
Fixes: bc6ba80858 ("nfit, address-range-scrub: rework and simplify ARS...")
Cc: <stable@vger.kernel.org>
Reported-by: Krzysztof Rusocki <krzysztof.rusocki@intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Recent fixes to command handling enabled Linux to read label
configurations that it could not before. Unfortunately that means that
configurations that were operating in label-less mode will be broken as
the kernel ignores the existing namespace configuration and tries to
honor the new found labels.
Fortunately this seems limited to a case where Linux can quirk the
behavior and maintain the existing label-less semantics by default.
When the platform does not emit an _LSW method, disable all label access
methods. Provide a 'force_labels' module parameter to allow read-only
label operation.
Fixes: 11189c1089 ("acpi/nfit: Fix command-supported detection")
Reported-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit 11189c1089 "acpi/nfit: Fix command-supported detection" broke
ND_CMD_CALL for bus-level commands. The "func = cmd" assumption is only
valid for:
ND_CMD_ARS_CAP
ND_CMD_ARS_START
ND_CMD_ARS_STATUS
ND_CMD_CLEAR_ERROR
The function number otherwise needs to be pulled from the command
payload for:
NFIT_CMD_TRANSLATE_SPA
NFIT_CMD_ARS_INJECT_SET
NFIT_CMD_ARS_INJECT_CLEAR
NFIT_CMD_ARS_INJECT_GET
Update cmd_to_func() for the bus case and call it in the common path.
Fixes: 11189c1089 ("acpi/nfit: Fix command-supported detection")
Cc: <stable@vger.kernel.org>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Grzegorz Burzynski <grzegorz.burzynski@intel.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As Dexuan reports the NVDIMM_FAMILY_HYPERV platform is incompatible with
the existing Linux namespace implementation because it uses
NSLABEL_FLAG_LOCAL for x1-width PMEM interleave sets. Quirk it as an
platform / DIMM that does not provide BLK-aperture access. Allow the
libnvdimm core to assume no potential for aliasing. In case other
implementations make the same mistake, provide a "noblk" module
parameter to force-enable the quirk.
Link: https://lkml.kernel.org/r/PU1P153MB0169977604493B82B662A01CBF920@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM
Reported-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add the Hyper-V _DSM command set to the white list of NVDIMM command
sets.
This command set is documented at http://www.uefi.org/RFIC_LIST
(see "Virtual NVDIMM 0x1901").
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In the case of ND_CMD_CALL, we should also check out_obj->type.
The patch uses out_obj->type, which is a short alias to
out_obj->package.type.
Fixes: 31eca76ba2 ("nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The implementation is broken in all the ways the unit test did not touch:
1/ The local definition of in_buf and in_obj violated C99 initializer
expectations for zeroing. By only initializing 2 out of the three
struct members the compiler was free to zero-initialize the remaining
entry even though the aliased location in the union was initialized.
2/ The implementation made assumptions about the state of the 'smart'
payload after command execution that are satisfied by
acpi_nfit_ctl(), but not acpi_evaluate_dsm().
3/ populate_shutdown_status() is skipped on Intel NVDIMMs due to the early
return for skipping the common _LS{I,R,W} enabling.
4/ The input length should be zero.
This breakage was missed due to the unit test implementation only
testing the case where nfit_intel_shutdown_status() returns a valid
payload.
Much of this complexity would be saved if acpi_nfit_ctl() could be used, but
that currently requires a 'struct nvdimm *' argument and one is not created
until later in the init process. The health result is needed before the device
is created because the payload gates whether the nmemX/nfit/dirty_shutdown
property is visible in sysfs.
Cc: <stable@vger.kernel.org>
Fixes: 0ead11181f ("acpi, nfit: Collect shutdown status")
Reported-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The _DSM function number validation only happens to succeed when the
generic Linux command number translation corresponds with a
DSM-family-specific function number. This breaks NVDIMM-N
implementations that correctly implement _LSR, _LSW, and _LSI, but do
not happen to publish support for DSM function numbers 4, 5, and 6.
Recall that the support for _LS{I,R,W} family of methods results in the
DIMM being marked as supporting those command numbers at
acpi_nfit_register_dimms() time. The DSM function mask is only used for
ND_CMD_CALL support of non-NVDIMM_FAMILY_INTEL devices.
Fixes: 31eca76ba2 ("nfit, libnvdimm: limited/whitelisted dimm command...")
Cc: <stable@vger.kernel.org>
Link: https://github.com/pmem/ndctl/issues/78
Reported-by: Sujith Pandel <sujith_pandel@dell.com>
Tested-by: Sujith Pandel <sujith_pandel@dell.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for using function number 0 as an error value, prevent it
from being considered a valid function value by acpi_nfit_ctl().
Cc: <stable@vger.kernel.org>
Cc: stuart hayes <stuart.w.hayes@gmail.com>
Fixes: e02fb7264d ("nfit: add Microsoft NVDIMM DSM command set...")
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The following warning:
ACPI0012:00: security event setup failed: -19
...is meant to capture exceptional failures of sysfs_get_dirent(),
however it will also fail in the common case when security support is
disabled. A few issues:
1/ A dev_warn() report for a common case is too chatty
2/ The setup of this notifier is generic, no need for it to be driven
from the nfit driver, it can exist completely in the core.
3/ If it fails for any reason besides security support being disabled,
that's fatal and should abort DIMM activation. Userspace may hang if
it never gets overwrite notifications.
4/ The dirent needs to be released.
Move the call to the core 'dimm' driver, make it conditional on security
support being active, make it fatal for the exceptional case, add the
missing sysfs_put() at device disable time.
Fixes: 7d988097c5 ("...Add security DSM overwrite support")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We allocate nd_set in acpi_nfit_init_interleave_set() and assignn it to
ndr_desc, while the assignment is done twice in this function.
This patch removes the first assignment. No functional change.
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Possible race accessing memdev structures after dropping the
mutex. Dan Williams says this could race against another thread
that is doing:
# echo "ACPI0012:00" > /sys/bus/acpi/drivers/nfit/unbind
Reported-by: Jane Chu <jane.chu@oracle.com>
Fixes: 23222f8f8d ("acpi, nfit: Add function to look up nvdimm...")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
On arm64 little endian allyesconfig:
drivers/acpi/nfit/intel.c:149:12: warning: unused function 'intel_security_unlock' [-Wunused-function]
static int intel_security_unlock(struct nvdimm *nvdimm,
^
drivers/acpi/nfit/intel.c:230:12: warning: unused function 'intel_security_erase' [-Wunused-function]
static int intel_security_erase(struct nvdimm *nvdimm,
^
drivers/acpi/nfit/intel.c:279:12: warning: unused function 'intel_security_query_overwrite' [-Wunused-function]
static int intel_security_query_overwrite(struct nvdimm *nvdimm)
^
drivers/acpi/nfit/intel.c:316:12: warning: unused function 'intel_security_overwrite' [-Wunused-function]
static int intel_security_overwrite(struct nvdimm *nvdimm,
^
4 warnings generated.
Mark these functions as __maybe_unused because they are only used when
CONFIG_X86 is set.
Fixes: 4c6926a23b ("acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs")
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The function to_acpi_nfit_desc and function to_acpi_desc
do the same things,delete the function to_acpi_nfit_desc,
and keep the inline function to_acpi_desc.
Signed-off-by: Xiaochun Lee <lixc17@lenovo.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The header file "intel.h" is repeated here, So delete one.
Signed-off-by: Xiaochun Lee <lixc17@lenovo.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).
Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.
Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.
[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com
Reported-by: Fan Du <fan.du@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Use common helpers, bitmap_zalloc() and kstrndup(), to replace open
coded versions.
* Clarify the comments around hotplug vs initial init case for the nfit
driver.
* Cleanup the libnvdimm init path.
With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update
master passphrase and master secure erase. The master passphrase allows
a secure erase to be performed without the user passphrase that is set on
the NVDIMM. The commands of master_update and master_erase are added to
the sysfs knob in order to initiate the DSMs. They are similar in opeartion
mechanism compare to update and erase.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
described by the Intel DSM spec v1.7. This will allow triggering of
overwrite on Intel NVDIMMs. The overwrite operation can take tens of
minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
be unaccessible. The kernel will do backoff polling to detect when the
overwrite process is completed. According to the DSM spec v1.7, the 128G
NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
take longer.
Given that overwrite puts the DIMM in an indeterminate state until it
completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
operations from executing when overwrite is happening. The
NDD_WORK_PENDING flag is added to denote that there is a device reference
on the nvdimm device for an async workqueue thread context.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add support to issue a secure erase DSM to the Intel nvdimm. The
required passphrase is acquired from an encrypted key in the kernel user
keyring. To trigger the action, "erase <keyid>" is written to the
"security" sysfs attribute.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add support to disable passphrase (security) for the Intel nvdimm. The
passphrase used for disabling is pulled from an encrypted-key in the kernel
user keyring. The action is triggered by writing "disable <keyid>" to the
sysfs attribute "security".
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add support to unlock the dimm via the kernel key management APIs. The
passphrase is expected to be pulled from userspace through keyutils.
The key management and sysfs attributes are libnvdimm generic.
Encrypted keys are used to protect the nvdimm passphrase at rest. The
master key can be a trusted-key sealed in a TPM, preferred, or an
encrypted-key, more flexible, but more exposure to a potential attacker.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add support for freeze security on Intel nvdimm. This locks out any
changes to security for the DIMM until a hard reset of the DIMM is
performed. This is triggered by writing "freeze" to the generic
nvdimm/nmemX "security" sysfs attribute.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command
set, expose a security capability to lock the DIMMs at poweroff and
require a passphrase to unlock them. The security model is derived from
ATA security. In anticipation of other DIMMs implementing a similar
scheme, and to abstract the core security implementation away from the
device-specific details, introduce nvdimm_security_ops.
Initially only a status retrieval operation, ->state(), is defined,
along with the base infrastructure and definitions for future
operations.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The generated dimm id is needed for the sysfs attribute as well as being
used as the identifier/description for the security key. Since it's
constant and should never change, store it as a member of struct nvdimm.
As nvdimm_create() continues to grow parameters relative to NFIT driver
requirements, do not require other implementations to keep pace.
Introduce __nvdimm_create() to carry the new parameters and keep
nvdimm_create() with the long standing default api.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add detailed explanation for why it's ok to return 0 if we fail to find
an NFIT at startup. Refer to chapter 9.20.2 NVDIMM Root Device in ACPI
6.2 spec.
Signed-off-by: Ocean He <hehy1@lenovo.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A "short" ARS (address range scrub) instructs the platform firmware to
return known errors. In contrast, a "long" ARS instructs platform
firmware to arrange every data address on the DIMM to be read / checked
for poisoned data.
The conversion of the flags in commit d3abaf43ba "acpi, nfit: Fix
Address Range Scrub completion tracking", changed the meaning of passing
'0' to acpi_nfit_ars_rescan(). Previously '0' meant "not short", now '0'
is ARS_REQ_SHORT. Pass ARS_REQ_LONG to restore the expected scrub-type
behavior of user-initiated ARS sessions.
Fixes: d3abaf43ba ("acpi, nfit: Fix Address Range Scrub completion tracking")
Reported-by: Jacek Zloch <jacek.zloch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add command definition for security commands defined in Intel DSM
specification v1.8 [1]. This includes "get security state", "set
passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite",
"overwrite query", "master passphrase enable/disable", and "master
erase", . Since this adds several Intel definitions, move the relevant
bits to their own header.
These commands mutate physical data, but that manipulation is not cache
coherent. The requirement to flush and invalidate caches makes these
commands unsuitable to be called from userspace, so extra logic is added
to detect and block these commands from being submitted via the ioctl
command submission path.
Lastly, the commands may contain sensitive key material that should not
be dumped in a standard debug session. Update the nvdimm-command
payload-dump facility to move security command payloads behind a
default-off compile time switch.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error injection
and platform-BIOS support enabled this corner case to be exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions triggers a
lockdep report. Revert and try again at the next merge window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb8MdUAAoJEB7SkWpmfYgCifYP/A+OQ19HybqcY2nfvqXUdQum
Q5x3qcKmGmEbKbnCUMOHZJpEjW4c/Cpm6OKhuFDQJ4tijn1XG3/ATSi7PXrZxs/o
CRK8MIg5Wz/mMvYRvypIkCHHHr9+Y1NjqmQynM4LLzNG24GMXaeHHuZUTnrZCDmu
0+jBTylNgVYdykoIxgHDYDB+cd6w4NtAP5OD9D46pdsmzX9ac+OQyZMyNB3glUhd
/ZFAoywVNfvfJVWEci9RoHiKttWxgVoCuNbSlCs2Y6ymepA44ApR9AgLHtaC9pFO
DrPkfCzPSmf4PVSxLJd79+/sw9YOcBD7LZ5IxzozxRMuRn5pIofdZIsBg9PlwT5B
NL9jQK87XPiG0vNxhJu3wzP+FlyCXxGxkWfApp7w4rlWBV7RgugOZHyH051rdKzQ
44JAPzLLCfA5Mj4o2tIbSx42f2JNX93XDEX8fkUB+qs3GzyOcMtlcmz9UjmnrT0R
o9KHKhDn81Vivxh33Ts2G0iHktO83XSUBDWApSd6erjEUXMsCLY0D8y+nDGTOMUh
kVcY8q93sgZGLVbcxt0eGc8Q7osZYawQGRGucflTETFcxNwMyLL4F9lWgPirGeYF
i1JDWeTrhcImYufNj8o78LsbT5xh6YjbZZ8Q1obIgPXpDtxHNIXO6COId49Zp2cK
obftWyVp+7kYe79NWzmD
=sfNx
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A small batch of fixes for v4.20-rc3.
The overflow continuation fix addresses something that has been broken
for several releases. Arguably it could wait even longer, but it's a
one line fix and this finishes the last of the known address range
scrub bug reports. The revert addresses a lockdep regression. The unit
tests are not critical to fix, but no reason to hold this fix back.
Summary:
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error
injection and platform-BIOS support enabled this corner case to be
exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions
triggers a lockdep report. Revert and try again at the next merge
window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)"
* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
Revert "acpi, nfit: Further restrict userspace ARS start requests"
acpi, nfit: Fix ARS overflow continuation
tools/testing/nvdimm: Fix the array size for dimm devices.
The following lockdep splat results from acquiring the init_mutex in
acpi_nfit_clear_to_send():
WARNING: possible circular locking dependency detected
lt-daxdev-error/7216 is trying to acquire lock:
00000000f694db15 (&acpi_desc->init_mutex){+.+.}, at: acpi_nfit_clear_to_send+0x27/0x80 [nfit]
but task is already holding lock:
00000000182298f2 (&nvdimm_bus->reconfig_mutex){+.+.}, at: __nd_ioctl+0x457/0x610 [libnvdimm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&nvdimm_bus->reconfig_mutex){+.+.}:
nvdimm_badblocks_populate+0x41/0x150 [libnvdimm]
nd_region_notify+0x95/0xb0 [libnvdimm]
nd_device_notify+0x40/0x50 [libnvdimm]
ars_complete+0x7f/0xd0 [nfit]
acpi_nfit_scrub+0xbb/0x410 [nfit]
process_one_work+0x22b/0x5c0
worker_thread+0x3c/0x390
kthread+0x11e/0x140
ret_from_fork+0x3a/0x50
-> #0 (&acpi_desc->init_mutex){+.+.}:
__mutex_lock+0x83/0x980
acpi_nfit_clear_to_send+0x27/0x80 [nfit]
__nd_ioctl+0x474/0x610 [libnvdimm]
nd_ioctl+0xa4/0xb0 [libnvdimm]
do_vfs_ioctl+0xa5/0x6e0
ksys_ioctl+0x70/0x80
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
New infrastructure is needed to be able to perform this check without
acquiring the lock.
Fixes: 594861215c ("acpi, nfit: Further restrict userspace ARS start")
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When the platform BIOS is unable to report all the media error records
it requires the OS to restart the scrub at a prescribed location. The
driver detects the overflow condition, but then fails to report it to
the ARS state machine after reaping the records. Propagate -ENOSPC
correctly to continue the ARS operation.
Cc: <stable@vger.kernel.org>
Fixes: 1cf03c00e7 ("nfit: scrub and register regions in a workqueue")
Reported-by: Jacek Zloch <jacek.zloch@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The NFIT machine check handler uses the physical address from the mce
structure, and compares it against information in the ACPI NFIT table
to determine whether that location lies on an NVDIMM. The mce->addr
field however may not always be valid, and this is indicated by the
MCI_STATUS_ADDRV bit in the status field.
Export mce_usable_address() which already performs validation for the
address, and use it in the NFIT handler.
Fixes: 6839a6d96f ("nfit: do an ARS scrub on hitting a latent media error")
Reported-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Arnd Bergmann <arnd@arndb.de>
Cc: Dan Williams <dan.j.williams@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: elliott@hpe.com
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Len Brown <lenb@kernel.org>
CC: linux-acpi@vger.kernel.org
CC: linux-edac <linux-edac@vger.kernel.org>
CC: linux-nvdimm@lists.01.org
CC: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
CC: "Rafael J. Wysocki" <rjw@rjwysocki.net>
CC: Ross Zwisler <zwisler@kernel.org>
CC: stable <stable@vger.kernel.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tony Luck <tony.luck@intel.com>
CC: x86-ml <x86@kernel.org>
CC: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20181026003729.8420-2-vishal.l.verma@intel.com
The MCE handler for nfit devices is called for memory errors on a
Non-Volatile DIMM and adds the error location to a 'badblocks' list.
This list is used by the various NVDIMM drivers to avoid consuming known
poison locations during IO.
The MCE handler gets called for both corrected and uncorrectable errors.
Until now, both kinds of errors have been added to the badblocks list.
However, corrected memory errors indicate that the problem has already
been fixed by hardware, and the resulting interrupt is merely a
notification to Linux.
As far as future accesses to that location are concerned, it is
perfectly fine to use, and thus doesn't need to be included in the above
badblocks list.
Add a check in the nfit MCE handler to filter out corrected mce events,
and only process uncorrectable errors.
Fixes: 6839a6d96f ("nfit: do an ARS scrub on hitting a latent media error")
Reported-by: Omar Avelar <omar.avelar@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Arnd Bergmann <arnd@arndb.de>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: elliott@hpe.com
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Len Brown <lenb@kernel.org>
CC: linux-acpi@vger.kernel.org
CC: linux-edac <linux-edac@vger.kernel.org>
CC: linux-nvdimm@lists.01.org
CC: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
CC: "Rafael J. Wysocki" <rjw@rjwysocki.net>
CC: Ross Zwisler <zwisler@kernel.org>
CC: stable <stable@vger.kernel.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tony Luck <tony.luck@intel.com>
CC: x86-ml <x86@kernel.org>
CC: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20181026003729.8420-1-vishal.l.verma@intel.com
In addition to not allowing ARS start while the background thread is
actively running, prevent ARS start while any scrub request is pending.
This aligns the window for ARS start submission with the status of ARS
reported via sysfs. Previously userspace could sneak its own ARS start
requests in while sysfs reported -EBUSY.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The Address Range Scrub implementation tried to skip running scrubs
against ranges that were already scrubbed by the BIOS. Unfortunately
that support also resulted in early scrub completions as evidenced by
this debug output from nfit_test:
nd_region region9: ARS: range 1 short complete
nd_region region3: ARS: range 1 short complete
nd_region region4: ARS: range 2 ARS start (0)
nd_region region4: ARS: range 2 short complete
...i.e. completions without any indications that the scrub was started.
This state of affairs was hard to see in the code due to the
proliferation of state bits and mistakenly trying to track done state
per-range when the completion is a global property of the bus.
So, kill the four ARS state bits (ARS_REQ, ARS_REQ_REDO, ARS_DONE, and
ARS_SHORT), and replace them with just 2 request flags ARS_REQ_SHORT and
ARS_REQ_LONG. The implementation will still complete and reap the
results of BIOS initiated ARS, but it will not attempt to use that
information to affect the completion status of scrubbing the ranges from
a Linux perspective.
Instead, try to synchronously run a short ARS per range at init time and
schedule a long scrub in the background. If ARS is busy with an ARS
request, schedule both a short and a long scrub for when ARS returns to
idle. This logic also satisfies the intent of what ARS_REQ_REDO was
trying to achieve. The new rule is that the REQ flag stays set until the
next successful ars_start() for that range.
With the new policy that the REQ flags are not cleared until the next
start, the implementation no longer loses requests as can be seen from
the following log:
nd_region region3: ARS: range 1 ARS start short (0)
nd_region region9: ARS: range 1 ARS start short (0)
nd_region region3: ARS: range 1 complete
nd_region region4: ARS: range 2 ARS start short (0)
nd_region region9: ARS: range 1 complete
nd_region region9: ARS: range 1 ARS start long (0)
nd_region region4: ARS: range 2 complete
nd_region region3: ARS: range 1 ARS start long (0)
nd_region region9: ARS: range 1 complete
nd_region region3: ARS: range 1 complete
nd_region region4: ARS: range 2 ARS start long (0)
nd_region region4: ARS: range 2 complete
...note that the nfit_test emulated driver provides 2 buses, that is why
some of the range indices are duplicated. Notice that each range
now successfully completes a short and long scrub.
Cc: <stable@vger.kernel.org>
Fixes: 14c73f997a ("nfit, address-range-scrub: introduce nfit_spa->ars_state")
Fixes: cc3d3458d4 ("acpi/nfit: queue issuing of ars when an uc error...")
Reported-by: Jacek Zloch <jacek.zloch@intel.com>
Reported-by: Krzysztof Rusocki <krzysztof.rusocki@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Allow the unit tests to verify the retrieval of the dirty shutdown
count via smart commands, and allow the driver-load-time retrieval of
the smart health payload to be simulated by nfit_test.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Some NVDIMMs, in addition to providing an indication of whether the
previous shutdown was clean, also provide a running count of lifetime
dirty-shutdown events for the device. In anticipation of this
functionality appearing on more devices arrange for the nfit driver to
retrieve / cache this data at DIMM discovery time, and export it via
sysfs.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for adding a flag to indicate whether a DIMM publishes a
dirty-shutdown count, convert the existing flags to a bit field.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Collection of misc libnvdimm patches for 4.19 submission
* Adding support to read locked nvdimm capacity.
* Change test code to make DSM failure code injection an override.
* Add support for calculate maximum contiguous area for namespace.
* Add support for queueing a short ARS when there is on going ARS for
nvdimm.
* Allow NULL to be passed in to ->direct_access() for kaddr and
pfn params.
* Improve smart injection support for nvdimm emulation testing.
* Fix test code that supports for emulating controller temperature.
* Fix hang on error before devm_memremap_pages()
* Fix a bug that causes user memory corruption when data returned
to user for ars_status.
* Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
=5p2n
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dave Jiang:
"Collection of misc libnvdimm patches for 4.19 submission:
- Adding support to read locked nvdimm capacity.
- Change test code to make DSM failure code injection an override.
- Add support for calculate maximum contiguous area for namespace.
- Add support for queueing a short ARS when there is on going ARS for
nvdimm.
- Allow NULL to be passed in to ->direct_access() for kaddr and pfn
params.
- Improve smart injection support for nvdimm emulation testing.
- Fix test code that supports for emulating controller temperature.
- Fix hang on error before devm_memremap_pages()
- Fix a bug that causes user memory corruption when data returned to
user for ars_status.
- Maintainer updates for Ross Zwisler emails and adding Jan Kara to
fsdax"
* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: fix ars_status output length calculation
device-dax: avoid hang on error before devm_memremap_pages()
tools/testing/nvdimm: improve emulation of smart injection
filesystem-dax: Do not request kaddr and pfn when not required
md/dm-writecache: Don't request pointer dummy_addr when not required
dax/super: Do not request a pointer kaddr when not required
tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
acpi/nfit: queue issuing of ars when an uc error notification comes in
libnvdimm: Export max available extent
libnvdimm: Use max contiguous area for namespace size
MAINTAINERS: Add Jan Kara for filesystem DAX
MAINTAINERS: update Ross Zwisler's email address
tools/testing/nvdimm: Fix support for emulating controller temperature
tools/testing/nvdimm: Make DSM failure code injection an override
acpi, nfit: Prefer _DSM over _LSR for namespace label reads
libnvdimm: Introduce locked DIMM capacity support
When the ACPI UC error notifier gets called and ARS_REQ bit is set
with the passed in flag, we can receive -EBUSY if ARS_REQ bit is already
set for the nfit_spa->ars_state. When that happens, the ARS request is
dropped. That can potentially cause us to miss the unreported errors that
the on going ARS request does not receive. Add an ARS_REQ_REDO state that
will request short ARS upon ARS completion to grab any errors we missed.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
The _LSR method indicates locked status via error-code-3 returned in the
_LSR payload. When any error is returned the payload of _LSR is
truncated to a zero-length buffer.
The _DSM path in comparison allows system software to retrieve the
locked status *and* namespace label area contents.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Incremental patch to fix the unchecked dereference in acpi_nfit_ctl.
Reported by Dan Carpenter:
"acpi/nfit: fix cmd_rc for acpi_nfit_ctl to
always return a value" from Jun 28, 2018, leads to the following
Smatch complaint:
drivers/acpi/nfit/core.c:578 acpi_nfit_ctl()
warn: variable dereferenced before check 'cmd_rc' (see line 411)
drivers/acpi/nfit/core.c
410
411 *cmd_rc = -EINVAL;
^^^^^^^^^^^^^^^^^^
Patch adds unchecked dereference.
Fixes: c1985cefd8 ("acpi/nfit: fix cmd_rc for acpi_nfit_ctl to always return a value")
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
The notification of scrub completion happens within the scrub workqueue.
That can clearly race someone running scrub_show() and work_busy()
before the workqueue has a chance to flush the recently completed work.
Add a flag to reliably indicate the idle vs busy state. Without this
change applications using poll(2) to wait for scrub-completion may
falsely wakeup and read ARS as being busy even though the thread is
going idle and then hang indefinitely.
Fixes: bc6ba80858 ("nfit, address-range-scrub: rework and simplify ARS...")
Cc: <stable@vger.kernel.org>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
cmd_rc is passed in by reference to the acpi_nfit_ctl() function and the
caller expects a value returned. However, when the package is pass through
via the ND_CMD_CALL command, cmd_rc is not touched. Make sure cmd_rc is
always set.
Fixes: aef2533822 ("libnvdimm, nfit: centralize command status translation")
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The "Clear Error Unit" may be smaller than the ECC unit size on some
devices. For example, poison may be tracked at 64-byte alignment even
though the ECC unit is larger. Unless / until the ACPI specification
provides a non-ambiguous way to communicate this property do not expose
this to userspace.
Software that had been using this property must already be prepared for
the case where the property is not provided on older kernels, so it is
safe to remove this attribute.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* A rework of the filesytem-dax implementation provides for detection of
unmap operations (truncate / hole punch) colliding with in-progress
device-DMA. A fix for these collisions remains a work-in-progress
pending resolution of truncate latency and starvation regressions.
* The of_pmem driver expands the users of libnvdimm outside of x86 and
ACPI to describe an implementation of persistent memory on PowerPC with
Open Firmware / Device tree.
* Address Range Scrub (ARS) handling is completely rewritten to account for
the fact that ARS may run for 100s of seconds and there is no platform
defined way to cancel it. ARS will now no longer block namespace
initialization.
* The NVDIMM Namespace Label implementation is updated to handle label
areas as small as 1K, down from 128K.
* Miscellaneous cleanups and updates to unit test infrastructure.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
/NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
9eGJG1H+JspcbtLF9iHn
=acZQ
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This cycle was was not something I ever want to repeat as there were
several late changes that have only now just settled.
Half of the branch up to commit d2c997c0f1 ("fs, dax: use
page->mapping to warn...") have been in -next for several releases.
The of_pmem driver and the address range scrub rework were late
arrivals, and the dax work was scaled back at the last moment.
The of_pmem driver missed a previous merge window due to an oversight.
A sense of obligation to rectify that miss is why it is included for
4.17. It has acks from PowerPC folks. Stephen reported a build failure
that only occurs when merging it with your latest tree, for now I have
fixed that up by disabling modular builds of of_pmem. A test merge
with your tree has received a build success report from the 0day robot
over 156 configs.
An initial version of the ARS rework was submitted before the merge
window. It is self contained to libnvdimm, a net code reduction, and
passing all unit tests.
The filesystem-dax changes are based on the wait_var_event()
functionality from tip/sched/core. However, late review feedback
showed that those changes regressed truncate performance to a large
degree. The branch was rewound to drop the truncate behavior change
and now only includes preparation patches and cleanups (with full acks
and reviews). The finalization of this dax-dma-vs-trnucate work will
need to wait for 4.18.
Summary:
- A rework of the filesytem-dax implementation provides for detection
of unmap operations (truncate / hole punch) colliding with
in-progress device-DMA. A fix for these collisions remains a
work-in-progress pending resolution of truncate latency and
starvation regressions.
- The of_pmem driver expands the users of libnvdimm outside of x86
and ACPI to describe an implementation of persistent memory on
PowerPC with Open Firmware / Device tree.
- Address Range Scrub (ARS) handling is completely rewritten to
account for the fact that ARS may run for 100s of seconds and there
is no platform defined way to cancel it. ARS will now no longer
block namespace initialization.
- The NVDIMM Namespace Label implementation is updated to handle
label areas as small as 1K, down from 128K.
- Miscellaneous cleanups and updates to unit test infrastructure"
* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
libnvdimm, of_pmem: workaround OF_NUMA=n build error
nfit, address-range-scrub: add module option to skip initial ars
nfit, address-range-scrub: rework and simplify ARS state machine
nfit, address-range-scrub: determine one platform max_ars value
powerpc/powernv: Create platform devs for nvdimm buses
doc/devicetree: Persistent memory region bindings
libnvdimm: Add device-tree based driver
libnvdimm: Add of_node to region and bus descriptors
libnvdimm, region: quiet region probe
libnvdimm, namespace: use a safe lookup for dimm device name
libnvdimm, dimm: fix dpa reservation vs uninitialized label area
libnvdimm, testing: update the default smart ctrl_temperature
libnvdimm, testing: Add emulation for smart injection commands
nfit, address-range-scrub: introduce nfit_spa->ars_state
libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
nfit, address-range-scrub: fix scrub in-progress reporting
dax, dm: allow device-mapper to operate without dax support
dax: introduce CONFIG_DAX_DRIVER
fs, dax: use page->mapping to warn if truncate collides with a busy page
ext2, dax: introduce ext2_dax_aops
...
After attempting to quickly retrieve known errors the kernel proceeds to
kick off a long running ARS. Add a module option to disable this
behavior at initialization time, or at new region discovery time.
Otherwise, ARS can be started manually regardless of the state of this
setting.
Co-developed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ARS is an operation that can take 10s to 100s of seconds to find media
errors that should rarely be present. If the platform crashes due to
media errors in persistent memory, the expectation is that the BIOS will
report those known errors in a 'short' ARS request.
A 'short' ARS request asks platform firmware to return an ARS payload
with all known errors, but without issuing a 'long' scrub. At driver
init a short request is issued to all PMEM ranges before registering
regions. Then, in the background, a long ARS is scheduled for each
region.
The ARS implementation is simplified to centralize ARS completion work
in the ars_complete() helper. The timeout is removed since there is no
facility to cancel ARS, and this otherwise arranges for system init to
never be blocked waiting for a 'long' ARS. The ars_state flags are used
to coordinate ARS requests from driver init, ARS requests from
userspace, and ARS requests in response to media error notifications.
Given that there is no notification of ARS completion the implementation
still needs to poll. It backs off exponentially to a maximum poll period
of 30 minutes.
Suggested-by: Toshi Kani <toshi.kani@hpe.com>
Co-developed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
acpi_nfit_query_poison() is awkward in that it requires an nfit_spa
argument in order to determine what max_ars value to use. Instead probe
for the minimum max_ars across all scrub-capable ranges in the system
and drop the nfit_spa argument.
This enables a larger rework / simplification of the ARS state machine
whereby the status can be retrieved once and then iterated over all
address ranges to reap completions.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for re-working the ARS implementation to better handle
short vs long ARS runs, introduce nfit_spa->ars_state. For now this just
replaces the nfit_spa->ars_required bit-field/flag, but going forward it
will be used to track ARS completion and make short vs long ARS
requests.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* misc fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAlrDZ6EACgkQEsHwGGHe
VUpWWA//RG9WqqdXxPa05TxiRFWMiNSH/HSGJQZsc3K7xm2TEGVdDv5thYsxXCaW
kA0oiUJA3zd62HeUIWMeGypm+hLM8T73836aIKcfxMduxQ6DR9nk51IV4RmHsLvT
rd4bUsrjCov6spapqS265LOxZVnpKvxZv0NVTj6pIk0pkeghWMZh63vZtrxVkLsS
6dHsdIDBr0fMhcFBVQ67SEDpMIyMAqziPOP9sv6NNElG4Q93RnZr6KYvQdE32Iev
Z6lIUU/sjVJEuhgp6FjJrhvjc+FmGJGERsFqo3Z8RT9P26Cj3wq0ov8IY/KSBazG
mfJFGbL+O3lppGKjjuAwMGr4R3fZMLQpgLxdYe7VNvOHB9ea7732q9MDGiNkN/u7
z8o+dtvmPWr0Y1ir8te+7LnSQXIAwxMkKP/lMStzzuFUETQDd0RWr7GAKlbrgPOU
+DhiQRPBFS7qpslA7BdqV6Ybr3nhDVZciDYWsvNpDMwdem05TxlFwC5l/BzP9NtM
FmQ3B0rQ5we6gib6UZWUjW3BH9wUgFUpu/WXKJl1unMHwYNe3CvdM8yD5W7mUylh
6SVYAmkZyuJ1Du40+YMeimroe77XV60IqoUFNEzMSFeB4/GxMG6+u2gUqNuTpHpZ
aMmCRj9DxWGxZyDsTc49An0X/JbVyIE/Aq6heMTYScgBvCaD1VE=
=K89j
-----END PGP SIGNATURE-----
Merge tag 'edac_for_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
"Noteworthy is the NVDIMM support:
- NVDIMM support to EDAC (Tony Luck)
- misc fixes"
* tag 'edac_for_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
EDAC, sb_edac: Remove variable length array usage
EDAC, skx_edac: Detect non-volatile DIMMs
firmware, DMI: Add function to look up a handle and return DIMM size
acpi, nfit: Add function to look up nvdimm device and provide SMBIOS handle
EDAC: Add new memory type for non-volatile DIMMs
EDAC: Drop duplicated array of strings for memory type names
EDAC, layerscape: Allow building for LS1021A
There is a small window whereby ARS scan requests can schedule work that
userspace will miss when polling scrub_show. Hold the init_mutex lock
over calls to report the status to close this potential escape. Also,
make sure that requests to cancel the ARS workqueue are treated as an
idle event.
Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Fixes: 37b137ff8c ("nfit, libnvdimm: allow an ARS scrub...")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit 1cf03c00e7 "nfit: scrub and register regions in a workqueue"
mistakenly attempts to register a region per BLK aperture. There is
nothing to register for individual apertures as they belong as a set to
a BLK aperture group that are registered with a corresponding
DIMM-control-region. Filter them for registration to prevent some
needless devm_kzalloc() allocations.
Cc: <stable@vger.kernel.org>
Fixes: 1cf03c00e7 ("nfit: scrub and register regions in a workqueue")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Some BIOSen do not handle 0-byte transfer lengths for the _LSR and _LSW
(label storage read/write) methods. This causes Linux to fallback to the
deprecated _DSM path, or otherwise disable label support.
Introduce acpi_nvdimm_has_method() to detect whether a method is
available rather than calling the method, require _LSI and _LSR to be
paired, and require read support before enabling write support.
Cc: <stable@vger.kernel.org>
Fixes: 4b27db7e26 ("acpi, nfit: add support for the _LS...")
Suggested-by: Erik Schmauss <erik.schmauss@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Per the ACPI specification the only functional purpose for a DIMM
Control Region to be mapped into the system physical address space, from
an OSPM perspective, is to support block-apertures. However, there are
some BIOSen that publish DIMM Control Region SPA entries for pre-boot
environment consumption. Undo the kernel policy of generating disabled
'ndblk' regions when this configuration is detected.
Cc: <stable@vger.kernel.org>
Fixes: 1f7df6f88b ("libnvdimm, nfit: regions (block-data-window...)")
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The persistence domain is a point in the platform where once writes
reach that destination the platform claims it will make them persistent
relative to power loss. In the ACPI NFIT this is currently communicated
as 2 bits in the "NFIT - Platform Capabilities Structure". The bits
comprise a hierarchy, i.e. bit0 "CPU Cache Flush to NVDIMM Durability on
Power Loss Capable" implies bit1 "Memory Controller Flush to NVDIMM
Durability on Power Loss Capable".
Commit 96c3a23905 "libnvdimm: expose platform persistence attr..."
shows the persistence domain as flags, but it's really an enumerated
hierarchy.
Fix this newly introduced user ABI to show the closest available
persistence domain before userspace develops dependencies on seeing, or
needing to develop code to tolerate, the raw NFIT flags communicated
through the libnvdimm-generic region attribute.
Fixes: 96c3a23905 ("libnvdimm: expose platform persistence attr...")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
EDAC driver needs to look up attributes of NVDIMMs provided in SMBIOS.
Provide a function that looks up an acpi_nfit_memory_map from a device
handle (node/socket/mc/channel/dimm) and returns the SMBIOS handle.
Also pass back the "flags" so we can see if the NVDIMM is OK.
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Erik Schmauss <erik.schmauss@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: devel@acpica.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/20180312182430.10335-4-tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Dynamic debug can be instructed to add the function name to the debug
output using the +f switch, so there is no need for the nfit module to
do it again. If a user decides to add the +f switch for nfit's dynamic
debug this results in double prints of the function name like the
following:
[ 2391.935383] acpi_nfit_ctl: nfit ACPI0012:00: acpi_nfit_ctl:nmem8 cmd: 10: func: 1 input length: 0
Thus remove the stray __func__ printing.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A NULL pointer reference kernel bug was observed when
acpi_nfit_add_dimm() called in acpi_nfit_register_dimms() failed. This
error path does not set nfit_mem->nvdimm, but the 2nd
list_for_each_entry() loop in the function assumes it's always set. Add
a check to nfit_mem->nvdimm.
Cc: <stable@vger.kernel.org>
Fixes: ba9c8dd3c2 ("acpi, nfit: add dimm device notification support")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Propagate the ADR attribute flag from the NFIT platform capabilities
sub-table to nd_region.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
In ACPI 6.2a the platform capability structure has been added to the NFIT
tables. That provides software the ability to determine whether a system
supports the auto flushing of CPU caches on power loss. If the capability
is supported, we do not need to do dax_flush(). Plumbing the path to set the
property on per region from the NFIT tables.
This patch depends on the ACPI NFIT 6.2a platform capabilities support code
in include/acpi/actbl1.h.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Integration testing with a BIOS that generates injected health event
notifications fails to communicate those events to userspace. The nfit
driver neglects to link the ACPI DIMM device with the necessary driver
data so acpi_nvdimm_notify() fails this lookup:
nfit_mem = dev_get_drvdata(dev);
if (nfit_mem && nfit_mem->flags_attr)
sysfs_notify_dirent(nfit_mem->flags_attr);
Add the necessary linkage when installing the notification handler and
clean it up when the nfit driver instance is torn down.
Cc: <stable@vger.kernel.org>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Fixes: ba9c8dd3c2 ("acpi, nfit: add dimm device notification support")
Reported-by: Daniel Osawa <daniel.k.osawa@intel.com>
Tested-by: Daniel Osawa <daniel.k.osawa@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may be
required to satisfy a write fault to also be flushed ("on disk") before
the kernel returns to userspace from the fault handler. Effectively
every write-fault that dirties metadata completes an fsync() before
returning from the fault handler. The new MAP_SHARED_VALIDATE mapping
type guarantees that the MAP_SYNC flag is validated as supported by the
filesystem's ->mmap() file operation.
* Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This
enables interoperability with environments that only implement the
standardized methods.
* Add support for the ACPI 6.2 NVDIMM media error injection methods.
* Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch
last shutdown status, firmware update, SMART error injection, and
SMART alarm threshold control.
* Cleanup physical address information disclosures to be root-only.
* Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
* Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
957ac8c421 dax: fix PMD faults on zero-length files
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
a39e596baa xfs: support for synchronous DAX faults
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
7b565c9f96 xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaDfvcAAoJEB7SkWpmfYgCk7sP/2qJhBH+VTTdg2osDnhAdAhI
co/AGEmsHFlUCMBb/Ek7UnMAmhBYiJU2q4ywPsNFBpusXpMlqNy5Iwo7k4/wQHE/
SJcIM0g4zg0ViFuUhwV+C2T0R5UzFR8JLd9EYWj/YS6aJpurtotm5l4UStaM0Hzo
AhxSXJLrBDuqCpbOxbctfiGEmdRL7aRfBEAARTNRKBn/iXxJUcYHlp62rtXQS+t4
I6LC/URCWTNTTMGmzW6TRsgSD9WMfd19xKcGzN3qL6ee0KFccxN4ctFqHA/sFGOh
iYLeR0XJUjJxyp+PkWGteXPVZL0Kj3bD/lSTG+Co5bm/ra8a/sh3TSFfgFyoBZD1
EqMN8Ryf80hGp3FabeH2Iw2SviYPZpHSWgjddjxLD0RA6OmpzINc+Wm8eqApjMME
sbZDTOijiab4QMQ0XamF4GuDHyQtawv5Y/w2Ehhl1tmiqW+5tKhsKqxkQt+/V3Yt
RTVSRe2Pkway66b+cD64IdQ6L2tyonPnmi5IzgkKOhlOEGomy+4/U2Jt2bMbhzq6
ymszKmXp2XI8P06wU8sHrIUeXO5I9qoKn/fZA73Eb8aIzgJe3tBE/5+Ab7RG6HB9
1OVfcMWoXU1gNgNktTs63X1Lsg4aW9kt/K4fPHHcqUcaliEJpJTlAbg9GLF2buoW
nQ+0fTRgMRihE3ZA0Fs3
=h2vZ
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"Save for a few late fixes, all of these commits have shipped in -next
releases since before the merge window opened, and 0day has given a
build success notification.
The ext4 touches came from Jan, and the xfs touches have Darrick's
reviewed-by. An xfstest for the MAP_SYNC feature has been through
a few round of reviews and is on track to be merged.
- Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may
be required to satisfy a write fault to also be flushed ("on disk")
before the kernel returns to userspace from the fault handler.
Effectively every write-fault that dirties metadata completes an
fsync() before returning from the fault handler. The new
MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
is validated as supported by the filesystem's ->mmap() file
operation.
- Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
This enables interoperability with environments that only implement
the standardized methods.
- Add support for the ACPI 6.2 NVDIMM media error injection methods.
- Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
latch last shutdown status, firmware update, SMART error injection,
and SMART alarm threshold control.
- Cleanup physical address information disclosures to be root-only.
- Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
- Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
- 957ac8c421 ("dax: fix PMD faults on zero-length files"):
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
- a39e596baa ("xfs: support for synchronous DAX faults") and
7b565c9f96 ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"
* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
acpi, nfit: add 'Enable Latch System Shutdown Status' command support
dax: fix general protection fault in dax_alloc_inode
dax: fix PMD faults on zero-length files
dax: stop requiring a live device for dax_flush()
brd: remove dax support
dax: quiet bdev_dax_supported()
fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
tools/testing/nvdimm: unit test clear-error commands
acpi, nfit: validate commands against the device type
tools/testing/nvdimm: stricter bounds checking for error injection commands
xfs: support for synchronous DAX faults
xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
ext4: Support for synchronous DAX faults
ext4: Simplify error handling in ext4_dax_huge_fault()
dax: Implement dax_finish_sync_fault()
dax, iomap: Add support for synchronous faults
mm: Define MAP_SYNC and VM_SYNC flags
dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
dax: Allow dax_iomap_fault() to return pfn
dax: Fix comment describing dax_iomap_fault()
...
The NVDIMM_FAMILY_INTEL 'Enable Latch System Shutdown Status' command
indicates to the platform that system software has acknowledged the most
recent unsafe shutdown status.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fix occasions in acpi_nfit_ctl where we check the command type without
validating whether we are parsing dimm vs bus level commands. Where the
command numbers alias between dimms and bus we can make the wrong
assumption just checking the raw command number. For example, with a
simple nfit_test mock up of the clear-error command we trigger the
following:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000094
IP: acpi_nfit_ctl+0x29b/0x930 [nfit]
[..]
Call Trace:
nfit_test_probe+0xb85/0xc09 [nfit_test]
platform_drv_probe+0x3b/0xa0
? platform_drv_probe+0x3b/0xa0
driver_probe_device+0x29c/0x450
? test_alloc+0x180/0x180 [nfit_test]
__driver_attach+0xe3/0xf0
? driver_probe_device+0x450/0x450
bus_for_each_dev+0x73/0xc0
driver_attach+0x1e/0x20
bus_add_driver+0x173/0x270
driver_register+0x60/0xe0
__platform_driver_register+0x36/0x40
nfit_test_init+0x2a1/0x1000 [nfit_test]
Fixes: 4b27db7e26 ("acpi, nfit: add support for the _LSI, _LSR, and...")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nfit_test needs to use the poison list manipulation code as well. Make
it more generic and in the process rename poison to badrange, and move
all the related helpers to a new file.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
[vishal: Add badrange.o to nfit_test's Kbuild]
[vishal: add a missed include in bus.c for the new badrange functions]
[vishal: rename all instances of 'be' to 'bre']
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Per v1.6 of the NVDIMM_FAMILY_INTEL command set [1] some of the new
commands require rev-id 2. In addition to enabling ND_CMD_CALL for these
new function numbers, add a lookup table for revision-ids by family
and function number.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For vendor specific commands that do not have a common kernel
translation, hide them from nmemX/commands. For example, the following
results from new enabling to probe for support of the new
NVDIMM_FAMILY_INTEL DSMs specified in v1.6 of the command specification
[1]:
# cat /sys/bus/nd/devices/nmem0/commands
smart smart_thresh flags get_size get_data set_data effect_size
effect_log vendor cmd_call unknown unknown unknown unknown unknown
unknown unknown unknown
[1]: https://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ACPI 6.2 adds support for named methods to access the label storage area
of an NVDIMM. We prefer these new methods if available and otherwise
fallback to the NVDIMM_FAMILY_INTEL _DSMs. The kernel ioctls,
ND_IOCTL_{GET,SET}_CONFIG_{SIZE,DATA}, remain generic and the driver
translates the 'package' payloads into the NVDIMM_FAMILY_INTEL 'buffer'
format to maintain compatibility with existing userspace and keep the
output buffer parsing code in the driver common.
The output payloads are mostly compatible save for the 'label area
locked' status that moves from the 'config_size' (_LSI) command to the
'config_read' (_LSR) command status.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Though nfit_test need to show what feature is supported via ND_CMD_CALL on
device/nfit/dsm_mask, currently there is no way to tell it.
This patch makes to enable it.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.
* The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.
* A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.
* Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZtsAGAAoJEB7SkWpmfYgCrzMP/2vPvZvrFjZn5pAoZjlmTmHM
ySceoOC7vwvVXIsSs52FhSjcxEoXo9cklXPwhXOPVtVUFdSDJBUOIUxwIziE6Y+5
sFJ2xT9K+5zKBUiXJwqFQDg52dn//eBNnnnDz+HQrBSzGrbWQhIZY2m19omPzv1I
BeN0OCGOdW3cjSo3BCFl1d+KrSl704e7paeKq/TO3GIiAilIXleTVxcefEEodV2K
ZvWHpFIhHeyN8dsF8teI952KcCT92CT/IaabxQIwCxX0/8/GFeDc5aqf77qiYWKi
uxCeQXdgnaE8EZNWZWGWIWul6eYEkoCNbLeUQ7eJnECq61VxVajJS0NyGa5T9OiM
P046Bo2b1b3R0IHxVIyVG0ZCm3YUMAHSn/3uRxPgESJ4bS/VQ3YP5M6MLxDOlc90
IisLilagitkK6h8/fVuVrwciRNQ71XEC34t6k7GCl/1ZnLlLT+i4/jc5NRZnGEZh
aXAAGHdteQ+/mSz6p2UISFUekbd6LerwzKRw8ibDvH6pTud8orYR7g2+JoGhgb6Y
pyFVE8DhIcqNKAMxBsjiRZ46OQ7qrT+AemdAG3aVv6FaNoe4o5jPLdw2cEtLqtpk
+DNm0/lSWxxxozjrvu6EUZj6hk8R5E19XpRzV5QJkcKUXMu7oSrFLdMcC4FeIjl9
K4hXLV3fVBVRMiS0RA6z
=5iGY
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm from Dan Williams:
"A rework of media error handling in the BTT driver and other updates.
It has appeared in a few -next releases and collected some late-
breaking build-error and warning fixups as a result.
Summary:
- Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.
- The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.
- A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.
- Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes"
* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
libnvdimm, btt: fix format string warnings
libnvdimm, btt: clean up warning and error messages
ext4: fix null pointer dereference on sbi
libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
dax: fix FS_DAX=n BLOCK=y compilation
libnvdimm: fix integer overflow static analysis warning
libnvdimm, nd_blk: remove mmio_flush_range()
libnvdimm, btt: rework error clearing
libnvdimm: fix potential deadlock while clearing errors
libnvdimm, btt: cache sector_size in arena_info
libnvdimm, btt: ensure that flags were also unchanged during a map_read
libnvdimm, btt: refactor map entry operations with macros
libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
ext4: perform dax_device lookup at mount
ext2: perform dax_device lookup at mount
xfs: perform dax_device lookup at mount
dax: introduce a fs_dax_get_by_bdev() helper
libnvdimm, btt: check memory allocation failure
libnvdimm, label: fix index block size calculation
...
Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl)
that uses it, as a prevention of a potential double-fetch bug.
While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where
the same userspace memory region are fetched twice into kernel with sanity
checks after the first fetch while missing checks after the second fetch.
In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:
1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg)
2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
(line 984 to 986).
3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)
4. Given that `p` can be fully controlled in userspace, an attacker can
race condition to override the header part of `p`, say,
`((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
(say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the
second fetch. The changed value will be copied to `buf`.
5. There is no checks on the second fetches until the use of it in
line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc)
which means that the assumed relation, `p->nd_reserved2` are all zeroes might
not hold after the second fetch. And once the control goes to these functions
we lose the context to assert the assumed relation.
6. Based on my manual analysis, `p->nd_reserved2` is not used in function
`nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
so there is no working exploit against it right now. However, this could
easily turns to an exploitable one if careless developers start to use
`p->nd_reserved2` later and assume that they are all zeroes.
Move the validation of the nd_reserved2 field to the ->ndctl()
implementation where it has a stable buffer to evaluate.
Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
mmio_flush_range() suffers from a lack of clearly-defined semantics,
and is somewhat ambiguous to port to other architectures where the
scope of the writeback implied by "flush" and ordering might matter,
but MMIO would tend to imply non-cacheable anyway. Per the rationale
in 67a3e8fe90 ("nd_blk: change aperture mapping from WC to WB"), the
only existing use is actually to invalidate clean cache lines for
ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
cleanup of the pmem API, that also now happens to be the exact purpose
of arch_invalidate_pmem(), which would be a far more well-defined tool
for the job.
Rather than risk potentially inconsistent implementations of
mmio_flush_range() for the sake of one callsite, streamline things by
removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
definitions up to the libnvdimm level, so they can be shared by NFIT
as well. This allows NFIT to be enabled for arm64.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When the nfit driver initializes it runs an ARS (Address Range Scrub)
operation across every pmem range. Part of that process involves
determining the ARS capabilities of a given address range. One of the
capabilities that is reported is the 'Clear Uncorrectable Error Range
Length Unit Size' (see: ACPI 6.2 section 9.20.7.4 Function Index 1 -
Query ARS Capabilities). This property is of interest to userspace
software as it indicates the boundary at which the NVDIMM may need to
perform read-modify-write cycles to maintain ECC blocks.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
COMPLETION_INITIALIZER_ONSTACK() is supposed to be used as an initializer,
in other words, it should only be used in assignment expressions or
compound literals. So the usage in drivers/acpi/nfit/core.c:
COMPLETION_INITIALIZER_ONSTACK(flush.cmp);
... is inappropriate.
Besides, this usage could also break the build for another fix that
reduces stack sizes caused by COMPLETION_INITIALIZER_ONSTACK(), because
that fix changes COMPLETION_INITIALIZER_ONSTACK() from rvalue to lvalue,
and usage as above will report the following error:
drivers/acpi/nfit/core.c: In function 'acpi_nfit_flush_probe':
include/linux/completion.h:77:3: error: value computed is not used [-Werror=unused-value]
(*({ init_completion(&work); &work; }))
This patch fixes this by replacing COMPLETION_INITIALIZER_ONSTACK()
with init_completion() in acpi_nfit_flush_probe(), which does the
same initialization without any other problems.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/20170824142239.15178-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is useful to be able to know the position of a DIMM in an
interleave-set. Consider the case where the order of the DIMMs changes
causing a namespace to be invalidated because the interleave-set cookie no
longer matches. If the before and after state of each DIMM position is
known this state debugged by the system owner.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nfit_init() calls nfit_mce_register() on module load. When the module
load fails the nfit mce decoder is not unregistered. The module's
memory is freed leaving the decoder chain referencing junk. This will
cause panics as future registrations will reference the free'd memory.
Unregister the nfit mce decoder on module init failure.
[v2]: register and then unregister mce handler to avoid losing mce events
[v3]: also cleanup nfit workqueue
Fixes: 6839a6d96f ("nfit: do an ARS scrub on hitting a latent media error")
Cc: <stable@vger.kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: "Lee, Chun-Yi" <joeyli.kernel@gmail.com>
Cc: Linda Knippers <linda.knippers@hpe.com>
Cc: lszubowi@redhat.com
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>