QUEUE_FLAG_DAX is an indication that a given block device supports
filesystem DAX and should not be set for PMEM namespaces which are in "raw"
mode. These namespaces lack struct page and are prevented from
participating in filesystem DAX as of commit 569d0365f5 ("dax: require
'struct page' by default for filesystem dax").
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 569d0365f5 ("dax: require 'struct page' by default for filesystem dax")
Cc: stable@vger.kernel.org
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
write to each of the flush hints for a region) in response to an
msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
we were setting up the request queue. This happens due to the write cache
value passed in to blk_queue_write_cache(), which then causes the block
layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set. We do have a
"write_cache" sysfs entry for namespaces, i.e.:
/sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache
which can be used to control whether or not the kernel thinks a given
namespace has a write cache, but this didn't modify the deep flush behavior
that we set up when the driver was initialized. Instead, it only modified
whether or not DAX would flush CPU caches via dax_flush() in response to
*sync calls.
Simplify this by making the *sync deep flush always happen, regardless of
the write cache setting of a namespace. The DAX CPU cache flushing will
still be controlled the write_cache setting of the namespace.
Cc: <stable@vger.kernel.org>
Fixes: 5fdf8e5ba5 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Complete the move from REQ_FLUSH to REQ_PREFLUSH that apparently started
way back in v4.8.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Use the machine check safe version of copy_to_iter() for the
->copy_to_iter() operation published by the pmem driver.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Similar to the ->copy_from_iter() operation, a platform may want to
deploy an architecture or device specific routine for handling reads
from a dax_device like /dev/pmemX. On x86 this routine will point to a
machine check safe version of copy_to_iter(). For now, add the plumbing
to device-mapper and the dax core.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.
Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.
For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Machine check safe memory copies are currently deployed in the pmem
driver whenever reading from persistent memory media, so that -EIO is
returned rather than triggering a kernel panic. While this protects most
pmem accesses, it is not complete in the filesystem-dax case. When
filesystem-dax is enabled reads may bypass the block layer and the
driver via dax_iomap_actor() and its usage of copy_to_iter().
In preparation for creating a copy_to_iter() variant that can handle
machine checks, teach memcpy_mcsafe() to return the number of bytes
remaining rather than -EFAULT when an exception occurs.
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: hch@lst.de
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/152539238119.31796.14318473522414462886.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* A rework of the filesytem-dax implementation provides for detection of
unmap operations (truncate / hole punch) colliding with in-progress
device-DMA. A fix for these collisions remains a work-in-progress
pending resolution of truncate latency and starvation regressions.
* The of_pmem driver expands the users of libnvdimm outside of x86 and
ACPI to describe an implementation of persistent memory on PowerPC with
Open Firmware / Device tree.
* Address Range Scrub (ARS) handling is completely rewritten to account for
the fact that ARS may run for 100s of seconds and there is no platform
defined way to cancel it. ARS will now no longer block namespace
initialization.
* The NVDIMM Namespace Label implementation is updated to handle label
areas as small as 1K, down from 128K.
* Miscellaneous cleanups and updates to unit test infrastructure.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
/NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
9eGJG1H+JspcbtLF9iHn
=acZQ
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This cycle was was not something I ever want to repeat as there were
several late changes that have only now just settled.
Half of the branch up to commit d2c997c0f1 ("fs, dax: use
page->mapping to warn...") have been in -next for several releases.
The of_pmem driver and the address range scrub rework were late
arrivals, and the dax work was scaled back at the last moment.
The of_pmem driver missed a previous merge window due to an oversight.
A sense of obligation to rectify that miss is why it is included for
4.17. It has acks from PowerPC folks. Stephen reported a build failure
that only occurs when merging it with your latest tree, for now I have
fixed that up by disabling modular builds of of_pmem. A test merge
with your tree has received a build success report from the 0day robot
over 156 configs.
An initial version of the ARS rework was submitted before the merge
window. It is self contained to libnvdimm, a net code reduction, and
passing all unit tests.
The filesystem-dax changes are based on the wait_var_event()
functionality from tip/sched/core. However, late review feedback
showed that those changes regressed truncate performance to a large
degree. The branch was rewound to drop the truncate behavior change
and now only includes preparation patches and cleanups (with full acks
and reviews). The finalization of this dax-dma-vs-trnucate work will
need to wait for 4.18.
Summary:
- A rework of the filesytem-dax implementation provides for detection
of unmap operations (truncate / hole punch) colliding with
in-progress device-DMA. A fix for these collisions remains a
work-in-progress pending resolution of truncate latency and
starvation regressions.
- The of_pmem driver expands the users of libnvdimm outside of x86
and ACPI to describe an implementation of persistent memory on
PowerPC with Open Firmware / Device tree.
- Address Range Scrub (ARS) handling is completely rewritten to
account for the fact that ARS may run for 100s of seconds and there
is no platform defined way to cancel it. ARS will now no longer
block namespace initialization.
- The NVDIMM Namespace Label implementation is updated to handle
label areas as small as 1K, down from 128K.
- Miscellaneous cleanups and updates to unit test infrastructure"
* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
libnvdimm, of_pmem: workaround OF_NUMA=n build error
nfit, address-range-scrub: add module option to skip initial ars
nfit, address-range-scrub: rework and simplify ARS state machine
nfit, address-range-scrub: determine one platform max_ars value
powerpc/powernv: Create platform devs for nvdimm buses
doc/devicetree: Persistent memory region bindings
libnvdimm: Add device-tree based driver
libnvdimm: Add of_node to region and bus descriptors
libnvdimm, region: quiet region probe
libnvdimm, namespace: use a safe lookup for dimm device name
libnvdimm, dimm: fix dpa reservation vs uninitialized label area
libnvdimm, testing: update the default smart ctrl_temperature
libnvdimm, testing: Add emulation for smart injection commands
nfit, address-range-scrub: introduce nfit_spa->ars_state
libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
nfit, address-range-scrub: fix scrub in-progress reporting
dax, dm: allow device-mapper to operate without dax support
dax: introduce CONFIG_DAX_DRIVER
fs, dax: use page->mapping to warn if truncate collides with a busy page
ext2, dax: introduce ext2_dax_aops
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
+Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
5J1axMgVH5LAsIEcEQVq
=RVhK
-----END PGP SIGNATURE-----
Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"It's a pretty quiet round this time, which is nice. This contains:
- series from Bart, cleaning up the way we set/test/clear atomic
queue flags.
- series from Bart, fixing races between gendisk and queue
registration and removal.
- set of bcache fixes and improvements from various folks, by way of
Michael Lyle.
- set of lightnvm updates from Matias, most of it being the 1.2 to
2.0 transition.
- removal of unused DIO flags from Nikolay.
- blk-mq/sbitmap memory ordering fixes from Omar.
- divide-by-zero fix for BFQ from Paolo.
- minor documentation patches from Randy.
- timeout fix from Tejun.
- Alpha "can't write a char atomically" fix from Mikulas.
- set of NVMe fixes by way of Keith.
- bsg and bsg-lib improvements from Christoph.
- a few sed-opal fixes from Jonas.
- cdrom check-disk-change deadlock fix from Maurizio.
- various little fixes, comment fixes, etc from various folks"
* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
blk-mq: Directly schedule q->timeout_work when aborting a request
blktrace: fix comment in blktrace_api.h
lightnvm: remove function name in strings
lightnvm: pblk: remove some unnecessary NULL checks
lightnvm: pblk: don't recover unwritten lines
lightnvm: pblk: implement 2.0 support
lightnvm: pblk: implement get log report chunk
lightnvm: pblk: rename ppaf* to addrf*
lightnvm: pblk: check for supported version
lightnvm: implement get log report chunk helpers
lightnvm: make address conversions depend on generic device
lightnvm: add support for 2.0 address format
lightnvm: normalize geometry nomenclature
lightnvm: complete geo structure with maxoc*
lightnvm: add shorten OCSSD version in geo
lightnvm: add minor version to generic geometry
lightnvm: simplify geometry structure
lightnvm: pblk: refactor init/exit sequences
lightnvm: Avoid validation of default op value
lightnvm: centralize permission check for lightnvm ioctl
...
Use module_nd_driver() instead of having module_init() and
module_exit() callbacks which just call nd_driver_register() and
nd_driver_unregister().
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch has been generated as follows:
for verb in set_unlocked clear_unlocked set clear; do
replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
$(git grep -lw queue_flag_${verb} drivers block/bsg*)
done
Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dynamic debug can be instructed to add the function name to the debug
output using the +f switch, so there is no need for the libnvdimm
modules to do it again. If a user decides to add the +f switch for
libnvdimm's dynamic debug this results in double prints of the function
name.
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Re-enable deep flush so that users always have a way to be sure that a
write makes it all the way out to media. Writes from the PMEM driver
always arrive at the NVDIMM since movnt is used to bypass the cache, and
the driver relies on the ADR (Asynchronous DRAM Refresh) mechanism to
flush write buffers on power failure. The Deep Flush mechanism is there
to explicitly write buffers to protect against (rare) ADR failure. This
change prevents a regression in deep flush behavior so that applications
can continue to depend on fsync() as a mechanism to trigger deep flush
in the filesystem-DAX case.
Fixes: 06e8ccdab1 ("acpi: nfit: Add support for detect platform CPU cache...")
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In ACPI 6.2a the platform capability structure has been added to the NFIT
tables. That provides software the ability to determine whether a system
supports the auto flushing of CPU caches on power loss. If the capability
is supported, we do not need to do dax_flush(). Plumbing the path to set the
property on per region from the NFIT tables.
This patch depends on the ACPI NFIT 6.2a platform capabilities support code
in include/acpi/actbl1.h.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
This new interface is similar to how struct device (and many others)
work. The caller initializes a 'struct dev_pagemap' as required
and calls 'devm_memremap_pages'. This allows the pagemap structure to
be embedded in another structure and thus container_of can be used. In
this way application specific members can be stored in a containing
struct.
This will be used by the P2P infrastructure and HMM could probably
be cleaned up to use it as well (instead of having it's own, similar
'hmm_devmem_pages_create' function).
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As discussed at
https://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>
someday we will remove rw_page(). If so, we need something to detect
such super-fast storage on which synchronous IO operations like the
current rw_page are always a win.
Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices. With it, we
could use various optimization techniques.
Link: http://lkml.kernel.org/r/1505886205-9671-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush abstraction
that was stood up for DM's use.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZuo8UAAoJEMUj8QotnQNa5BEIANO4mHh1nrzEbH72a4RCLgxV
H1Pk1zZx/W1bhOOmcRRhxCSM85dPgsCegc5EmpwLZEMavQrP9UZblHcYOUsyIx7W
S/lWa+soOq/5N2OveROc4WdoWVs50UFmc1+BcClc4YrEe+15XC3R0VMkjX2b/hUL
o2eYhPjpMlgaorMtRRU6MAooo2fBRQ9m05aPeVgd35fxibrE7PZm+EYW09wa0STi
9ufuDXJf8+TtFP/38BD41LbUEskuHUZTSDeAJ+3DBaTtfEZcZYxsst4P9JangsHx
jqqqI9aYzFD2a27fl9WLhCvm40YFiKp5nwzED0RZjzWxVa/jTShX7a49BdzTTfw=
=rkSB
-----END PGP SIGNATURE-----
Merge tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Some request-based DM core and DM multipath fixes and cleanups
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM
integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush
abstraction that was stood up for DM's use.
* tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dax: remove the pmem_dax_ops->flush abstraction
dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
dm integrity: make blk_integrity_profile structure const
dm integrity: do not check integrity for failed read operations
dm log writes: fix >512b sectorsize support
dm log writes: don't use all the cpu while waiting to log blocks
dm ioctl: constify ioctl lookup table
dm: constify argument arrays
dm integrity: count and display checksum failures
dm integrity: optimize writing dm-bufio buffers that are partially changed
dm rq: do not update rq partially in each ending bio
dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
dm mpath: complain about unsupported __multipath_map_bio() return values
dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
Commit abebfbe2f7 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.
It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().
It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.
Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.
Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device. To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.
Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Introduce the _flushcache() family of memory copy helpers and use them
for persistent memory write operations on x86. The _flushcache()
semantic indicates that the cache is either bypassed for the copy
operation (movnt) or any lines dirtied by the copy operation are
written back (clwb, clflushopt, or clflush).
* Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
* Add support for the new NVDIMM platform/firmware mechanisms introduced
in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace
label format, extensions to the address-range-scrub command set, new
error injection commands, and a new BTT (block-translation-table)
layout. These updates support inter-OS and pre-OS compatibility.
* Fix a longstanding memory corruption bug in nfit_test.
* Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
* Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed:
commit 6aa734a2f3 "libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime"
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e
mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X
rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7
5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT
WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks
rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ
btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i
2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6
MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7
JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg
BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ
pMHuMAqIcIyLhIe2/sRF
=xKQB
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"libnvdimm updates for the latest ACPI and UEFI specifications. This
pull request also includes new 'struct dax_operations' enabling to
undo the abuse of copy_user_nocache() for copy operations to pmem.
The dax work originally missed 4.12 to address concerns raised by Al.
Summary:
- Introduce the _flushcache() family of memory copy helpers and use
them for persistent memory write operations on x86. The
_flushcache() semantic indicates that the cache is either bypassed
for the copy operation (movnt) or any lines dirtied by the copy
operation are written back (clwb, clflushopt, or clflush).
- Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
- Add support for the new NVDIMM platform/firmware mechanisms
introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
namespace label format, extensions to the address-range-scrub
command set, new error injection commands, and a new BTT
(block-translation-table) layout. These updates support inter-OS
and pre-OS compatibility.
- Fix a longstanding memory corruption bug in nfit_test.
- Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
- Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed: commit
6aa734a2f3 ("libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
<toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
libnvdimm, namespace: record 'lbasize' for pmem namespaces
acpi/nfit: Issue Start ARS to retrieve existing records
libnvdimm: New ACPI 6.2 DSM functions
acpi, nfit: Show bus_dsm_mask in sysfs
libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
acpi, nfit: Enable DSM pass thru for root functions.
libnvdimm: passthru functions clear to send
libnvdimm, btt: convert some info messages to warn/err
libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
libnvdimm: fix the clear-error check in nsio_rw_bytes
libnvdimm, btt: fix btt_rw_page not returning errors
acpi, nfit: quiet invalid block-aperture-region warnings
libnvdimm, btt: BTT updates for UEFI 2.7 format
acpi, nfit: constify *_attribute_group
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem, dax: export a cache control attribute
dax: convert to bitmask for flags
dax: remove default copy_from_iter fallback
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: fix persistence warning
...
We need to hold a reference on the 'dirent' until we are sure there are
no more notifications that will be sent. As noted in the new comments we
take advantage of the fact that the references are taken and dropped
under device_lock() and that nd_device_notify() holds device_lock() over
new badblocks notifications. The notifications that happen when
badblocks are cleared only occur while the device is active.
Also take the opportunity to fix up the error messages to report the
user visible effect of a sysfs_get_dirent() failure.
Fixes: 975750a98c ("libnvdimm, pmem: Add sysfs notifications to badblocks")
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set dax_write_cache() if it is.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The dax_flush() operation can be turned into a nop on platforms where
firmware arranges for cpu caches to be flushed on a power-fail event.
The ACPI 6.2 specification defines a mechanism for the platform to
indicate this capability so the kernel can select the proper default.
However, for other platforms, the administrator must toggle this setting
manually.
Given this flush setting is a dax-specific mechanism we advertise it
through a 'dax' attribute group hanging off a host device. For example,
a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a
'write_cache' attribute to control response to dax cache flush requests.
This is similar to the 'queue/write_cache' attribute that appears under
block devices.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that all callers of the pmem api have been converted to dax helpers that
call back to the pmem driver, we can remove include/linux/pmem.h and
asm/pmem.h.
Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove include/linux/pmem.h and asm/pmem.h.
Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With all handling of the CONFIG_ARCH_HAS_PMEM_API case being moved to
libnvdimm and the pmem driver directly we do not need to provide global
wrappers and fallbacks in the CONFIG_ARCH_HAS_PMEM_API=n case. The pmem
driver will simply not link to arch_wb_cache_pmem() in that case. Same
as before, pmem flushing is only defined for x86_64, via
clean_cache_range(), but it is straightforward to add other archs in the
future.
arch_wb_cache_pmem() is an exported function since the pmem module needs
to find it, but it is privately declared in drivers/nvdimm/pmem.h because
there are no consumers outside of the pmem driver.
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.
An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Sysfs "badblocks" information may be updated during run-time that:
- MCE, SCI, and sysfs "scrub" may add new bad blocks
- Writes and ioctl() may clear bad blocks
Add support to send sysfs notifications to sysfs "badblocks" file
under region and pmem directories when their badblocks information
is re-evaluated (but is not necessarily changed) during run-time.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Previously we only honored the lba size for blk-aperture mode
namespaces. For pmem namespaces the lba size was just assumed to be 512.
With the new v1.2 label definition and compatibility with other
operating environments, the ->lbasize property is now respected for pmem
namespaces.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes are not
cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
(non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
to ensure data-writes have reached a power-fail-safe zone in the platform. The
fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
around and fence previous writes with an "sfence".
Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
memcpy_flushcache, that guarantee that the destination buffer is not dirty in
the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
will be used to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.
This is meant to satisfy the concern from Linus that if a driver wants to do
something beyond the normal nocache semantics it should be something private to
that driver [1], and Al's concern that anything uaccess related belongs with
the rest of the uaccess code [2].
The first consumer of this interface is a new 'copy_from_iter' dax operation so
that pmem can inject cache maintenance operations without imposing this
overhead on other dax-capable drivers.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
* Region media error reporting: A libnvdimm region device is the parent
to one or more namespaces. To date, media errors have been reported via
the "badblocks" attribute attached to pmem block devices for namespaces
in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
or "btt-sector" mode this new interface reports media errors
generically, i.e. independent of namespace modes or state. This
subsequently allows userspace tooling to craft "ACPI 6.1 Section
9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
submit them via the ioctl path for NVDIMM root bus devices.
* Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
a request from Linus and feedback from Christoph this allows for dax
capable drivers to publish their own custom dax operations. This fixes
the broken assumption that all dax operations are related to a
persistent memory device, and makes it easier for other architectures
and platforms to add customized persistent memory support.
* 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger memory
controllers to drain write-pending buffers that would otherwise be
flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
mechanism at a power loss event. Support for "locked" DIMMs is included
to prevent namespaces from surfacing when the namespace label data area
is locked. Finally, fixes for various reported deadlocks and crashes,
also tagged for -stable.
* ACPI / nfit driver updates: General updates of the nfit driver to add
DSM command overrides, ACPI 6.1 health state flags support, DSM payload
debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
Tested-by: Yi Zhang <yizhan@redhat.com>
commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
TmJQ3Ly9t3o3/weHSzmn
=Ox0s
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.
Change summary:
- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.
This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.
- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.
- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.
- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
- commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang <yizhan@redhat.com>
- commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...
Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:
- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)
- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)
- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)
- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)
- ... plus misc updates, fixes and cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.
The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)
But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.
One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)
So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:
Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.
This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.
This speeds up things while also making the pmem refcounting more robust going
forward.
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following BUG was observed when nd_pmem_notify() was called
for a BTT device. The use of a pmem_device pointer is not valid
with BTT.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
Call Trace:
nd_device_notify+0x40/0x50
child_notify+0x10/0x20
device_for_each_child+0x50/0x90
nd_region_notify+0x20/0x30
nd_device_notify+0x40/0x50
nvdimm_region_notify+0x27/0x30
acpi_nfit_scrub+0x341/0x590 [nfit]
process_one_work+0x197/0x450
worker_thread+0x4e/0x4a0
kthread+0x109/0x140
Fix nd_pmem_notify() by setting nd_region and badblocks pointers
properly for BTT.
Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Fixes: 719994660c ("libnvdimm: async notification support")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.
This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The read_pmem() function uses memcpy_mcsafe() on x86 where an EFAULT
error code indicates a failed read. Block I/O should use EIO to
indicate failure. Other pmem code paths (like bad blocks) already use
EIO so let's be consistent.
This fixes compatibility with consumers like btrfs that try to parse the
specific error code rather than treat all errors the same.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Colin, via static analysis, reports that the length could be negative
from nvdimm_clear_poison() in the error case. There was a similar
problem with commit 0a3f27b9a6 "libnvdimm, namespace: avoid multiple
sector calculations" that I noticed when merging the for-4.10/libnvdimm
topic branch into libnvdimm-for-next, but I missed this one. Fix both of
them to the following procedure:
* if we clear a block's worth of media, clear that many blocks in
badblocks
* if we clear less than the requested size of the transfer return an
error
* always invalidate cache after any non-error / non-zero
nvdimm_clear_poison result
Fixes: 82bf1037f2 ("libnvdimm: check and clear poison before writing to pmem")
Fixes: 0a3f27b9a6 ("libnvdimm, namespace: avoid multiple sector calculations")
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Dave Jiang <dave.jiang@intel.com>
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Use sector_t for cleared
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Here is an example /proc/iomem listing for a system with 2 namespaces,
one in "sector" mode and one in "memory" mode:
1fc000000-2fbffffff : Persistent Memory (legacy)
1fc000000-2fbffffff : namespace1.0
340000000-34fffffff : Persistent Memory
340000000-34fffffff : btt0.1
Here is the corresponding ndctl listing:
# ndctl list
[
{
"dev":"namespace1.0",
"mode":"memory",
"size":4294967296,
"blockdev":"pmem1"
},
{
"dev":"namespace0.0",
"mode":"sector",
"size":267091968,
"uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
"sector_size":4096,
"blockdev":"pmem0s"
}
]
Notice that the ndctl listing is purely in terms of namespace devices,
while the iomem listing leaks the internal "btt0.1" implementation
detail. Given that ndctl requires the namespace device name to change
the mode, for example:
# ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
...use the namespace name in the iomem listing to keep the claiming
device name consistent across different mode settings.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>