HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel: - Improve clarity, locking and APIs related to the 'hmm mirror' feature merged last cycle. In linux-next we now see AMDGPU and nouveau to be using this API. - Remove old or transitional hmm APIs. These are hold overs from the past with no users, or APIs that existed only to manage cross tree conflicts. There are still a few more of these cleanups that didn't make the merge window cut off. - Improve some core mm APIs: * export alloc_pages_vma() for driver use * refactor into devm_request_free_mem_region() to manage DEVICE_PRIVATE resource reservations * refactor duplicative driver code into the core dev_pagemap struct - Remove hmm wrappers of improved core mm APIs, instead have drivers use the simplified API directly - Remove DEVICE_PUBLIC - Simplify the kconfig flow for the hmm users and core code -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/ VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8= =wKvp -----END PGP SIGNATURE----- Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull HMM updates from Jason Gunthorpe: "Improvements and bug fixes for the hmm interface in the kernel: - Improve clarity, locking and APIs related to the 'hmm mirror' feature merged last cycle. In linux-next we now see AMDGPU and nouveau to be using this API. - Remove old or transitional hmm APIs. These are hold overs from the past with no users, or APIs that existed only to manage cross tree conflicts. There are still a few more of these cleanups that didn't make the merge window cut off. - Improve some core mm APIs: - export alloc_pages_vma() for driver use - refactor into devm_request_free_mem_region() to manage DEVICE_PRIVATE resource reservations - refactor duplicative driver code into the core dev_pagemap struct - Remove hmm wrappers of improved core mm APIs, instead have drivers use the simplified API directly - Remove DEVICE_PUBLIC - Simplify the kconfig flow for the hmm users and core code" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits) mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR mm: remove the HMM config option mm: sort out the DEVICE_PRIVATE Kconfig mess mm: simplify ZONE_DEVICE page private data mm: remove hmm_devmem_add mm: remove hmm_vma_alloc_locked_page nouveau: use devm_memremap_pages directly nouveau: use alloc_page_vma directly PCI/P2PDMA: use the dev_pagemap internal refcount device-dax: use the dev_pagemap internal refcount memremap: provide an optional internal refcount in struct dev_pagemap memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag memremap: remove the data field in struct dev_pagemap memremap: add a migrate_to_ram method to struct dev_pagemap_ops memremap: lift the devmap_enable manipulation into devm_memremap_pages memremap: pass a struct dev_pagemap to ->kill and ->cleanup memremap: move dev_pagemap callbacks into a separate structure memremap: validate the pagemap type passed to devm_memremap_pages mm: factor out a devm_request_free_mem_region helper mm: export alloc_pages_vma ...
This commit is contained in:
commit
fec88ab0af
|
@ -10,7 +10,7 @@ of this being specialized struct page for such memory (see sections 5 to 7 of
|
|||
this document).
|
||||
|
||||
HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
|
||||
allowing a device to transparently access program address coherently with
|
||||
allowing a device to transparently access program addresses coherently with
|
||||
the CPU meaning that any valid pointer on the CPU is also a valid pointer
|
||||
for the device. This is becoming mandatory to simplify the use of advanced
|
||||
heterogeneous computing where GPU, DSP, or FPGA are used to perform various
|
||||
|
@ -22,8 +22,8 @@ expose the hardware limitations that are inherent to many platforms. The third
|
|||
section gives an overview of the HMM design. The fourth section explains how
|
||||
CPU page-table mirroring works and the purpose of HMM in this context. The
|
||||
fifth section deals with how device memory is represented inside the kernel.
|
||||
Finally, the last section presents a new migration helper that allows lever-
|
||||
aging the device DMA engine.
|
||||
Finally, the last section presents a new migration helper that allows
|
||||
leveraging the device DMA engine.
|
||||
|
||||
.. contents:: :local:
|
||||
|
||||
|
@ -39,20 +39,20 @@ address space. I use shared address space to refer to the opposite situation:
|
|||
i.e., one in which any application memory region can be used by a device
|
||||
transparently.
|
||||
|
||||
Split address space happens because device can only access memory allocated
|
||||
through device specific API. This implies that all memory objects in a program
|
||||
Split address space happens because devices can only access memory allocated
|
||||
through a device specific API. This implies that all memory objects in a program
|
||||
are not equal from the device point of view which complicates large programs
|
||||
that rely on a wide set of libraries.
|
||||
|
||||
Concretely this means that code that wants to leverage devices like GPUs needs
|
||||
to copy object between generically allocated memory (malloc, mmap private, mmap
|
||||
Concretely, this means that code that wants to leverage devices like GPUs needs
|
||||
to copy objects between generically allocated memory (malloc, mmap private, mmap
|
||||
share) and memory allocated through the device driver API (this still ends up
|
||||
with an mmap but of the device file).
|
||||
|
||||
For flat data sets (array, grid, image, ...) this isn't too hard to achieve but
|
||||
complex data sets (list, tree, ...) are hard to get right. Duplicating a
|
||||
for complex data sets (list, tree, ...) it's hard to get right. Duplicating a
|
||||
complex data set needs to re-map all the pointer relations between each of its
|
||||
elements. This is error prone and program gets harder to debug because of the
|
||||
elements. This is error prone and programs get harder to debug because of the
|
||||
duplicate data set and addresses.
|
||||
|
||||
Split address space also means that libraries cannot transparently use data
|
||||
|
@ -77,12 +77,12 @@ I/O bus, device memory characteristics
|
|||
|
||||
I/O buses cripple shared address spaces due to a few limitations. Most I/O
|
||||
buses only allow basic memory access from device to main memory; even cache
|
||||
coherency is often optional. Access to device memory from CPU is even more
|
||||
coherency is often optional. Access to device memory from a CPU is even more
|
||||
limited. More often than not, it is not cache coherent.
|
||||
|
||||
If we only consider the PCIE bus, then a device can access main memory (often
|
||||
through an IOMMU) and be cache coherent with the CPUs. However, it only allows
|
||||
a limited set of atomic operations from device on main memory. This is worse
|
||||
a limited set of atomic operations from the device on main memory. This is worse
|
||||
in the other direction: the CPU can only access a limited range of the device
|
||||
memory and cannot perform atomic operations on it. Thus device memory cannot
|
||||
be considered the same as regular memory from the kernel point of view.
|
||||
|
@ -93,20 +93,20 @@ The final limitation is latency. Access to main memory from the device has an
|
|||
order of magnitude higher latency than when the device accesses its own memory.
|
||||
|
||||
Some platforms are developing new I/O buses or additions/modifications to PCIE
|
||||
to address some of these limitations (OpenCAPI, CCIX). They mainly allow two-
|
||||
way cache coherency between CPU and device and allow all atomic operations the
|
||||
to address some of these limitations (OpenCAPI, CCIX). They mainly allow
|
||||
two-way cache coherency between CPU and device and allow all atomic operations the
|
||||
architecture supports. Sadly, not all platforms are following this trend and
|
||||
some major architectures are left without hardware solutions to these problems.
|
||||
|
||||
So for shared address space to make sense, not only must we allow devices to
|
||||
access any memory but we must also permit any memory to be migrated to device
|
||||
memory while device is using it (blocking CPU access while it happens).
|
||||
memory while the device is using it (blocking CPU access while it happens).
|
||||
|
||||
|
||||
Shared address space and migration
|
||||
==================================
|
||||
|
||||
HMM intends to provide two main features. First one is to share the address
|
||||
HMM intends to provide two main features. The first one is to share the address
|
||||
space by duplicating the CPU page table in the device page table so the same
|
||||
address points to the same physical memory for any valid main memory address in
|
||||
the process address space.
|
||||
|
@ -121,14 +121,14 @@ why HMM provides helpers to factor out everything that can be while leaving the
|
|||
hardware specific details to the device driver.
|
||||
|
||||
The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
|
||||
allows allocating a struct page for each page of the device memory. Those pages
|
||||
allows allocating a struct page for each page of device memory. Those pages
|
||||
are special because the CPU cannot map them. However, they allow migrating
|
||||
main memory to device memory using existing migration mechanisms and everything
|
||||
looks like a page is swapped out to disk from the CPU point of view. Using a
|
||||
struct page gives the easiest and cleanest integration with existing mm mech-
|
||||
anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
|
||||
looks like a page that is swapped out to disk from the CPU point of view. Using a
|
||||
struct page gives the easiest and cleanest integration with existing mm
|
||||
mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
|
||||
memory for the device memory and second to perform migration. Policy decisions
|
||||
of what and when to migrate things is left to the device driver.
|
||||
of what and when to migrate is left to the device driver.
|
||||
|
||||
Note that any CPU access to a device page triggers a page fault and a migration
|
||||
back to main memory. For example, when a page backing a given CPU address A is
|
||||
|
@ -136,8 +136,8 @@ migrated from a main memory page to a device page, then any CPU access to
|
|||
address A triggers a page fault and initiates a migration back to main memory.
|
||||
|
||||
With these two features, HMM not only allows a device to mirror process address
|
||||
space and keeping both CPU and device page table synchronized, but also lever-
|
||||
ages device memory by migrating the part of the data set that is actively being
|
||||
space and keeps both CPU and device page tables synchronized, but also
|
||||
leverages device memory by migrating the part of the data set that is actively being
|
||||
used by the device.
|
||||
|
||||
|
||||
|
@ -151,21 +151,28 @@ registration of an hmm_mirror struct::
|
|||
|
||||
int hmm_mirror_register(struct hmm_mirror *mirror,
|
||||
struct mm_struct *mm);
|
||||
int hmm_mirror_register_locked(struct hmm_mirror *mirror,
|
||||
struct mm_struct *mm);
|
||||
|
||||
|
||||
The locked variant is to be used when the driver is already holding mmap_sem
|
||||
of the mm in write mode. The mirror struct has a set of callbacks that are used
|
||||
The mirror struct has a set of callbacks that are used
|
||||
to propagate CPU page tables::
|
||||
|
||||
struct hmm_mirror_ops {
|
||||
/* release() - release hmm_mirror
|
||||
*
|
||||
* @mirror: pointer to struct hmm_mirror
|
||||
*
|
||||
* This is called when the mm_struct is being released. The callback
|
||||
* must ensure that all access to any pages obtained from this mirror
|
||||
* is halted before the callback returns. All future access should
|
||||
* fault.
|
||||
*/
|
||||
void (*release)(struct hmm_mirror *mirror);
|
||||
|
||||
/* sync_cpu_device_pagetables() - synchronize page tables
|
||||
*
|
||||
* @mirror: pointer to struct hmm_mirror
|
||||
* @update_type: type of update that occurred to the CPU page table
|
||||
* @start: virtual start address of the range to update
|
||||
* @end: virtual end address of the range to update
|
||||
* @update: update information (see struct mmu_notifier_range)
|
||||
* Return: -EAGAIN if update.blockable false and callback need to
|
||||
* block, 0 otherwise.
|
||||
*
|
||||
* This callback ultimately originates from mmu_notifiers when the CPU
|
||||
* page table is updated. The device driver must update its page table
|
||||
|
@ -176,14 +183,12 @@ to propagate CPU page tables::
|
|||
* page tables are completely updated (TLBs flushed, etc); this is a
|
||||
* synchronous call.
|
||||
*/
|
||||
void (*update)(struct hmm_mirror *mirror,
|
||||
enum hmm_update action,
|
||||
unsigned long start,
|
||||
unsigned long end);
|
||||
int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
|
||||
const struct hmm_update *update);
|
||||
};
|
||||
|
||||
The device driver must perform the update action to the range (mark range
|
||||
read only, or fully unmap, ...). The device must be done with the update before
|
||||
read only, or fully unmap, etc.). The device must complete the update before
|
||||
the driver callback returns.
|
||||
|
||||
When the device driver wants to populate a range of virtual addresses, it can
|
||||
|
@ -194,17 +199,18 @@ use either::
|
|||
|
||||
The first one (hmm_range_snapshot()) will only fetch present CPU page table
|
||||
entries and will not trigger a page fault on missing or non-present entries.
|
||||
The second one does trigger a page fault on missing or read-only entry if the
|
||||
write parameter is true. Page faults use the generic mm page fault code path
|
||||
just like a CPU page fault.
|
||||
The second one does trigger a page fault on missing or read-only entries if
|
||||
write access is requested (see below). Page faults use the generic mm page
|
||||
fault code path just like a CPU page fault.
|
||||
|
||||
Both functions copy CPU page table entries into their pfns array argument. Each
|
||||
entry in that array corresponds to an address in the virtual range. HMM
|
||||
provides a set of flags to help the driver identify special CPU page table
|
||||
entries.
|
||||
|
||||
Locking with the update() callback is the most important aspect the driver must
|
||||
respect in order to keep things properly synchronized. The usage pattern is::
|
||||
Locking within the sync_cpu_device_pagetables() callback is the most important
|
||||
aspect the driver must respect in order to keep things properly synchronized.
|
||||
The usage pattern is::
|
||||
|
||||
int driver_populate_range(...)
|
||||
{
|
||||
|
@ -239,11 +245,11 @@ respect in order to keep things properly synchronized. The usage pattern is::
|
|||
hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
|
||||
goto again;
|
||||
}
|
||||
hmm_mirror_unregister(&range);
|
||||
hmm_range_unregister(&range);
|
||||
return ret;
|
||||
}
|
||||
take_lock(driver->update);
|
||||
if (!range.valid) {
|
||||
if (!hmm_range_valid(&range)) {
|
||||
release_lock(driver->update);
|
||||
up_read(&mm->mmap_sem);
|
||||
goto again;
|
||||
|
@ -251,15 +257,15 @@ respect in order to keep things properly synchronized. The usage pattern is::
|
|||
|
||||
// Use pfns array content to update device page table
|
||||
|
||||
hmm_mirror_unregister(&range);
|
||||
hmm_range_unregister(&range);
|
||||
release_lock(driver->update);
|
||||
up_read(&mm->mmap_sem);
|
||||
return 0;
|
||||
}
|
||||
|
||||
The driver->update lock is the same lock that the driver takes inside its
|
||||
update() callback. That lock must be held before checking the range.valid
|
||||
field to avoid any race with a concurrent CPU page table update.
|
||||
sync_cpu_device_pagetables() callback. That lock must be held before calling
|
||||
hmm_range_valid() to avoid any race with a concurrent CPU page table update.
|
||||
|
||||
HMM implements all this on top of the mmu_notifier API because we wanted a
|
||||
simpler API and also to be able to perform optimizations latter on like doing
|
||||
|
@ -279,46 +285,47 @@ concurrently).
|
|||
Leverage default_flags and pfn_flags_mask
|
||||
=========================================
|
||||
|
||||
The hmm_range struct has 2 fields default_flags and pfn_flags_mask that allows
|
||||
to set fault or snapshot policy for a whole range instead of having to set them
|
||||
for each entries in the range.
|
||||
The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify
|
||||
fault or snapshot policy for the whole range instead of having to set them
|
||||
for each entry in the pfns array.
|
||||
|
||||
For instance if the device flags for device entries are:
|
||||
VALID (1 << 63)
|
||||
WRITE (1 << 62)
|
||||
For instance, if the device flags for range.flags are::
|
||||
|
||||
Now let say that device driver wants to fault with at least read a range then
|
||||
it does set::
|
||||
range.flags[HMM_PFN_VALID] = (1 << 63);
|
||||
range.flags[HMM_PFN_WRITE] = (1 << 62);
|
||||
|
||||
and the device driver wants pages for a range with at least read permission,
|
||||
it sets::
|
||||
|
||||
range->default_flags = (1 << 63);
|
||||
range->pfn_flags_mask = 0;
|
||||
|
||||
and calls hmm_range_fault() as described above. This will fill fault all page
|
||||
and calls hmm_range_fault() as described above. This will fill fault all pages
|
||||
in the range with at least read permission.
|
||||
|
||||
Now let say driver wants to do the same except for one page in the range for
|
||||
which its want to have write. Now driver set::
|
||||
Now let's say the driver wants to do the same except for one page in the range for
|
||||
which it wants to have write permission. Now driver set::
|
||||
|
||||
range->default_flags = (1 << 63);
|
||||
range->pfn_flags_mask = (1 << 62);
|
||||
range->pfns[index_of_write] = (1 << 62);
|
||||
|
||||
With this HMM will fault in all page with at least read (ie valid) and for the
|
||||
With this, HMM will fault in all pages with at least read (i.e., valid) and for the
|
||||
address == range->start + (index_of_write << PAGE_SHIFT) it will fault with
|
||||
write permission ie if the CPU pte does not have write permission set then HMM
|
||||
write permission i.e., if the CPU pte does not have write permission set then HMM
|
||||
will call handle_mm_fault().
|
||||
|
||||
Note that HMM will populate the pfns array with write permission for any entry
|
||||
that have write permission within the CPU pte no matter what are the values set
|
||||
Note that HMM will populate the pfns array with write permission for any page
|
||||
that is mapped with CPU write permission no matter what values are set
|
||||
in default_flags or pfn_flags_mask.
|
||||
|
||||
|
||||
Represent and manage device memory from core kernel point of view
|
||||
=================================================================
|
||||
|
||||
Several different designs were tried to support device memory. First one used
|
||||
a device specific data structure to keep information about migrated memory and
|
||||
HMM hooked itself in various places of mm code to handle any access to
|
||||
Several different designs were tried to support device memory. The first one
|
||||
used a device specific data structure to keep information about migrated memory
|
||||
and HMM hooked itself in various places of mm code to handle any access to
|
||||
addresses that were backed by device memory. It turns out that this ended up
|
||||
replicating most of the fields of struct page and also needed many kernel code
|
||||
paths to be updated to understand this new kind of memory.
|
||||
|
@ -329,33 +336,6 @@ directly using struct page for device memory which left most kernel code paths
|
|||
unaware of the difference. We only need to make sure that no one ever tries to
|
||||
map those pages from the CPU side.
|
||||
|
||||
HMM provides a set of helpers to register and hotplug device memory as a new
|
||||
region needing a struct page. This is offered through a very simple API::
|
||||
|
||||
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
|
||||
struct device *device,
|
||||
unsigned long size);
|
||||
void hmm_devmem_remove(struct hmm_devmem *devmem);
|
||||
|
||||
The hmm_devmem_ops is where most of the important things are::
|
||||
|
||||
struct hmm_devmem_ops {
|
||||
void (*free)(struct hmm_devmem *devmem, struct page *page);
|
||||
int (*fault)(struct hmm_devmem *devmem,
|
||||
struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
struct page *page,
|
||||
unsigned flags,
|
||||
pmd_t *pmdp);
|
||||
};
|
||||
|
||||
The first callback (free()) happens when the last reference on a device page is
|
||||
dropped. This means the device page is now free and no longer used by anyone.
|
||||
The second callback happens whenever the CPU tries to access a device page
|
||||
which it cannot do. This second callback must trigger a migration back to
|
||||
system memory.
|
||||
|
||||
|
||||
Migration to and from device memory
|
||||
===================================
|
||||
|
||||
|
@ -417,9 +397,9 @@ willing to pay to keep all the code simpler.
|
|||
Memory cgroup (memcg) and rss accounting
|
||||
========================================
|
||||
|
||||
For now device memory is accounted as any regular page in rss counters (either
|
||||
For now, device memory is accounted as any regular page in rss counters (either
|
||||
anonymous if device page is used for anonymous, file if device page is used for
|
||||
file backed page or shmem if device page is used for shared memory). This is a
|
||||
file backed page, or shmem if device page is used for shared memory). This is a
|
||||
deliberate choice to keep existing applications, that might start using device
|
||||
memory without knowing about it, running unimpacted.
|
||||
|
||||
|
@ -439,6 +419,6 @@ get more experience in how device memory is used and its impact on memory
|
|||
resource control.
|
||||
|
||||
|
||||
Note that device memory can never be pinned by device driver nor through GUP
|
||||
Note that device memory can never be pinned by a device driver nor through GUP
|
||||
and thus such memory is always free upon process exit. Or when last reference
|
||||
is dropped in case of shared memory or file backed memory.
|
||||
|
|
|
@ -131,17 +131,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
|
|||
{
|
||||
unsigned long start_pfn = start >> PAGE_SHIFT;
|
||||
unsigned long nr_pages = size >> PAGE_SHIFT;
|
||||
struct page *page;
|
||||
struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
|
||||
int ret;
|
||||
|
||||
/*
|
||||
* If we have an altmap then we need to skip over any reserved PFNs
|
||||
* when querying the zone.
|
||||
*/
|
||||
page = pfn_to_page(start_pfn);
|
||||
if (altmap)
|
||||
page += vmem_altmap_offset(altmap);
|
||||
|
||||
__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
|
||||
|
||||
/* Remove htab bolted mappings for this section of memory */
|
||||
|
|
|
@ -1213,13 +1213,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
|
|||
{
|
||||
unsigned long start_pfn = start >> PAGE_SHIFT;
|
||||
unsigned long nr_pages = size >> PAGE_SHIFT;
|
||||
struct page *page = pfn_to_page(start_pfn);
|
||||
struct zone *zone;
|
||||
struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
|
||||
struct zone *zone = page_zone(page);
|
||||
|
||||
/* With altmap the first mapped page is offset from @start */
|
||||
if (altmap)
|
||||
page += vmem_altmap_offset(altmap);
|
||||
zone = page_zone(page);
|
||||
__remove_pages(zone, start_pfn, nr_pages, altmap);
|
||||
kernel_physical_mapping_remove(start, start + size);
|
||||
}
|
||||
|
|
|
@ -43,8 +43,6 @@ struct dax_region {
|
|||
* @target_node: effective numa node if dev_dax memory range is onlined
|
||||
* @dev - device core
|
||||
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
|
||||
* @ref: pgmap reference count (driver owned)
|
||||
* @cmp: @ref final put completion (driver owned)
|
||||
*/
|
||||
struct dev_dax {
|
||||
struct dax_region *region;
|
||||
|
@ -52,8 +50,6 @@ struct dev_dax {
|
|||
int target_node;
|
||||
struct device dev;
|
||||
struct dev_pagemap pgmap;
|
||||
struct percpu_ref ref;
|
||||
struct completion cmp;
|
||||
};
|
||||
|
||||
static inline struct dev_dax *to_dev_dax(struct device *dev)
|
||||
|
|
|
@ -14,37 +14,6 @@
|
|||
#include "dax-private.h"
|
||||
#include "bus.h"
|
||||
|
||||
static struct dev_dax *ref_to_dev_dax(struct percpu_ref *ref)
|
||||
{
|
||||
return container_of(ref, struct dev_dax, ref);
|
||||
}
|
||||
|
||||
static void dev_dax_percpu_release(struct percpu_ref *ref)
|
||||
{
|
||||
struct dev_dax *dev_dax = ref_to_dev_dax(ref);
|
||||
|
||||
dev_dbg(&dev_dax->dev, "%s\n", __func__);
|
||||
complete(&dev_dax->cmp);
|
||||
}
|
||||
|
||||
static void dev_dax_percpu_exit(struct percpu_ref *ref)
|
||||
{
|
||||
struct dev_dax *dev_dax = ref_to_dev_dax(ref);
|
||||
|
||||
dev_dbg(&dev_dax->dev, "%s\n", __func__);
|
||||
wait_for_completion(&dev_dax->cmp);
|
||||
percpu_ref_exit(ref);
|
||||
}
|
||||
|
||||
static void dev_dax_percpu_kill(struct percpu_ref *data)
|
||||
{
|
||||
struct percpu_ref *ref = data;
|
||||
struct dev_dax *dev_dax = ref_to_dev_dax(ref);
|
||||
|
||||
dev_dbg(&dev_dax->dev, "%s\n", __func__);
|
||||
percpu_ref_kill(ref);
|
||||
}
|
||||
|
||||
static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
|
||||
const char *func)
|
||||
{
|
||||
|
@ -459,15 +428,7 @@ int dev_dax_probe(struct device *dev)
|
|||
return -EBUSY;
|
||||
}
|
||||
|
||||
init_completion(&dev_dax->cmp);
|
||||
rc = percpu_ref_init(&dev_dax->ref, dev_dax_percpu_release, 0,
|
||||
GFP_KERNEL);
|
||||
if (rc)
|
||||
return rc;
|
||||
|
||||
dev_dax->pgmap.ref = &dev_dax->ref;
|
||||
dev_dax->pgmap.kill = dev_dax_percpu_kill;
|
||||
dev_dax->pgmap.cleanup = dev_dax_percpu_exit;
|
||||
dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
|
||||
addr = devm_memremap_pages(dev, &dev_dax->pgmap);
|
||||
if (IS_ERR(addr))
|
||||
return PTR_ERR(addr);
|
||||
|
|
|
@ -16,7 +16,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
|
|||
struct dev_dax *dev_dax;
|
||||
struct nd_namespace_io *nsio;
|
||||
struct dax_region *dax_region;
|
||||
struct dev_pagemap pgmap = { 0 };
|
||||
struct dev_pagemap pgmap = { };
|
||||
struct nd_namespace_common *ndns;
|
||||
struct nd_dax *nd_dax = to_nd_dax(dev);
|
||||
struct nd_pfn *nd_pfn = &nd_dax->nd_pfn;
|
||||
|
|
|
@ -84,11 +84,11 @@ config DRM_NOUVEAU_BACKLIGHT
|
|||
|
||||
config DRM_NOUVEAU_SVM
|
||||
bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
|
||||
depends on ARCH_HAS_HMM
|
||||
depends on DEVICE_PRIVATE
|
||||
depends on DRM_NOUVEAU
|
||||
depends on HMM_MIRROR
|
||||
depends on STAGING
|
||||
select HMM_MIRROR
|
||||
select DEVICE_PRIVATE
|
||||
select MIGRATE_VMA_HELPER
|
||||
default n
|
||||
help
|
||||
Say Y here if you want to enable experimental support for
|
||||
|
|
|
@ -72,7 +72,8 @@ struct nouveau_dmem_migrate {
|
|||
};
|
||||
|
||||
struct nouveau_dmem {
|
||||
struct hmm_devmem *devmem;
|
||||
struct nouveau_drm *drm;
|
||||
struct dev_pagemap pagemap;
|
||||
struct nouveau_dmem_migrate migrate;
|
||||
struct list_head chunk_free;
|
||||
struct list_head chunk_full;
|
||||
|
@ -80,6 +81,11 @@ struct nouveau_dmem {
|
|||
struct mutex mutex;
|
||||
};
|
||||
|
||||
static inline struct nouveau_dmem *page_to_dmem(struct page *page)
|
||||
{
|
||||
return container_of(page->pgmap, struct nouveau_dmem, pagemap);
|
||||
}
|
||||
|
||||
struct nouveau_dmem_fault {
|
||||
struct nouveau_drm *drm;
|
||||
struct nouveau_fence *fence;
|
||||
|
@ -96,14 +102,10 @@ struct nouveau_migrate {
|
|||
unsigned long dma_nr;
|
||||
};
|
||||
|
||||
static void
|
||||
nouveau_dmem_free(struct hmm_devmem *devmem, struct page *page)
|
||||
static void nouveau_dmem_page_free(struct page *page)
|
||||
{
|
||||
struct nouveau_dmem_chunk *chunk;
|
||||
unsigned long idx;
|
||||
|
||||
chunk = (void *)hmm_devmem_page_get_drvdata(page);
|
||||
idx = page_to_pfn(page) - chunk->pfn_first;
|
||||
struct nouveau_dmem_chunk *chunk = page->zone_device_data;
|
||||
unsigned long idx = page_to_pfn(page) - chunk->pfn_first;
|
||||
|
||||
/*
|
||||
* FIXME:
|
||||
|
@ -148,11 +150,12 @@ nouveau_dmem_fault_alloc_and_copy(struct vm_area_struct *vma,
|
|||
if (!spage || !(src_pfns[i] & MIGRATE_PFN_MIGRATE))
|
||||
continue;
|
||||
|
||||
dpage = hmm_vma_alloc_locked_page(vma, addr);
|
||||
dpage = alloc_page_vma(GFP_HIGHUSER, vma, addr);
|
||||
if (!dpage) {
|
||||
dst_pfns[i] = MIGRATE_PFN_ERROR;
|
||||
continue;
|
||||
}
|
||||
lock_page(dpage);
|
||||
|
||||
dst_pfns[i] = migrate_pfn(page_to_pfn(dpage)) |
|
||||
MIGRATE_PFN_LOCKED;
|
||||
|
@ -194,7 +197,7 @@ nouveau_dmem_fault_alloc_and_copy(struct vm_area_struct *vma,
|
|||
|
||||
dst_addr = fault->dma[fault->npages++];
|
||||
|
||||
chunk = (void *)hmm_devmem_page_get_drvdata(spage);
|
||||
chunk = spage->zone_device_data;
|
||||
src_addr = page_to_pfn(spage) - chunk->pfn_first;
|
||||
src_addr = (src_addr << PAGE_SHIFT) + chunk->bo->bo.offset;
|
||||
|
||||
|
@ -259,29 +262,21 @@ static const struct migrate_vma_ops nouveau_dmem_fault_migrate_ops = {
|
|||
.finalize_and_map = nouveau_dmem_fault_finalize_and_map,
|
||||
};
|
||||
|
||||
static vm_fault_t
|
||||
nouveau_dmem_fault(struct hmm_devmem *devmem,
|
||||
struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
const struct page *page,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp)
|
||||
static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
|
||||
{
|
||||
struct drm_device *drm_dev = dev_get_drvdata(devmem->device);
|
||||
struct nouveau_dmem *dmem = page_to_dmem(vmf->page);
|
||||
unsigned long src[1] = {0}, dst[1] = {0};
|
||||
struct nouveau_dmem_fault fault = {0};
|
||||
struct nouveau_dmem_fault fault = { .drm = dmem->drm };
|
||||
int ret;
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* FIXME what we really want is to find some heuristic to migrate more
|
||||
* than just one page on CPU fault. When such fault happens it is very
|
||||
* likely that more surrounding page will CPU fault too.
|
||||
*/
|
||||
fault.drm = nouveau_drm(drm_dev);
|
||||
ret = migrate_vma(&nouveau_dmem_fault_migrate_ops, vma, addr,
|
||||
addr + PAGE_SIZE, src, dst, &fault);
|
||||
ret = migrate_vma(&nouveau_dmem_fault_migrate_ops, vmf->vma,
|
||||
vmf->address, vmf->address + PAGE_SIZE,
|
||||
src, dst, &fault);
|
||||
if (ret)
|
||||
return VM_FAULT_SIGBUS;
|
||||
|
||||
|
@ -291,10 +286,9 @@ nouveau_dmem_fault(struct hmm_devmem *devmem,
|
|||
return 0;
|
||||
}
|
||||
|
||||
static const struct hmm_devmem_ops
|
||||
nouveau_dmem_devmem_ops = {
|
||||
.free = nouveau_dmem_free,
|
||||
.fault = nouveau_dmem_fault,
|
||||
static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
|
||||
.page_free = nouveau_dmem_page_free,
|
||||
.migrate_to_ram = nouveau_dmem_migrate_to_ram,
|
||||
};
|
||||
|
||||
static int
|
||||
|
@ -580,7 +574,8 @@ void
|
|||
nouveau_dmem_init(struct nouveau_drm *drm)
|
||||
{
|
||||
struct device *device = drm->dev->dev;
|
||||
unsigned long i, size;
|
||||
struct resource *res;
|
||||
unsigned long i, size, pfn_first;
|
||||
int ret;
|
||||
|
||||
/* This only make sense on PASCAL or newer */
|
||||
|
@ -590,6 +585,7 @@ nouveau_dmem_init(struct nouveau_drm *drm)
|
|||
if (!(drm->dmem = kzalloc(sizeof(*drm->dmem), GFP_KERNEL)))
|
||||
return;
|
||||
|
||||
drm->dmem->drm = drm;
|
||||
mutex_init(&drm->dmem->mutex);
|
||||
INIT_LIST_HEAD(&drm->dmem->chunk_free);
|
||||
INIT_LIST_HEAD(&drm->dmem->chunk_full);
|
||||
|
@ -599,11 +595,8 @@ nouveau_dmem_init(struct nouveau_drm *drm)
|
|||
|
||||
/* Initialize migration dma helpers before registering memory */
|
||||
ret = nouveau_dmem_migrate_init(drm);
|
||||
if (ret) {
|
||||
kfree(drm->dmem);
|
||||
drm->dmem = NULL;
|
||||
return;
|
||||
}
|
||||
if (ret)
|
||||
goto out_free;
|
||||
|
||||
/*
|
||||
* FIXME we need some kind of policy to decide how much VRAM we
|
||||
|
@ -611,14 +604,16 @@ nouveau_dmem_init(struct nouveau_drm *drm)
|
|||
* and latter if we want to do thing like over commit then we
|
||||
* could revisit this.
|
||||
*/
|
||||
drm->dmem->devmem = hmm_devmem_add(&nouveau_dmem_devmem_ops,
|
||||
device, size);
|
||||
if (IS_ERR(drm->dmem->devmem)) {
|
||||
kfree(drm->dmem);
|
||||
drm->dmem = NULL;
|
||||
return;
|
||||
}
|
||||
res = devm_request_free_mem_region(device, &iomem_resource, size);
|
||||
if (IS_ERR(res))
|
||||
goto out_free;
|
||||
drm->dmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
|
||||
drm->dmem->pagemap.res = *res;
|
||||
drm->dmem->pagemap.ops = &nouveau_dmem_pagemap_ops;
|
||||
if (IS_ERR(devm_memremap_pages(device, &drm->dmem->pagemap)))
|
||||
goto out_free;
|
||||
|
||||
pfn_first = res->start >> PAGE_SHIFT;
|
||||
for (i = 0; i < (size / DMEM_CHUNK_SIZE); ++i) {
|
||||
struct nouveau_dmem_chunk *chunk;
|
||||
struct page *page;
|
||||
|
@ -631,17 +626,19 @@ nouveau_dmem_init(struct nouveau_drm *drm)
|
|||
}
|
||||
|
||||
chunk->drm = drm;
|
||||
chunk->pfn_first = drm->dmem->devmem->pfn_first;
|
||||
chunk->pfn_first += (i * DMEM_CHUNK_NPAGES);
|
||||
chunk->pfn_first = pfn_first + (i * DMEM_CHUNK_NPAGES);
|
||||
list_add_tail(&chunk->list, &drm->dmem->chunk_empty);
|
||||
|
||||
page = pfn_to_page(chunk->pfn_first);
|
||||
for (j = 0; j < DMEM_CHUNK_NPAGES; ++j, ++page) {
|
||||
hmm_devmem_page_set_drvdata(page, (long)chunk);
|
||||
}
|
||||
for (j = 0; j < DMEM_CHUNK_NPAGES; ++j, ++page)
|
||||
page->zone_device_data = chunk;
|
||||
}
|
||||
|
||||
NV_INFO(drm, "DMEM: registered %ldMB of device memory\n", size >> 20);
|
||||
return;
|
||||
out_free:
|
||||
kfree(drm->dmem);
|
||||
drm->dmem = NULL;
|
||||
}
|
||||
|
||||
static void
|
||||
|
@ -697,7 +694,7 @@ nouveau_dmem_migrate_alloc_and_copy(struct vm_area_struct *vma,
|
|||
if (!dpage || dst_pfns[i] == MIGRATE_PFN_ERROR)
|
||||
continue;
|
||||
|
||||
chunk = (void *)hmm_devmem_page_get_drvdata(dpage);
|
||||
chunk = dpage->zone_device_data;
|
||||
dst_addr = page_to_pfn(dpage) - chunk->pfn_first;
|
||||
dst_addr = (dst_addr << PAGE_SHIFT) + chunk->bo->bo.offset;
|
||||
|
||||
|
@ -832,13 +829,7 @@ out:
|
|||
static inline bool
|
||||
nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
|
||||
{
|
||||
if (!is_device_private_page(page))
|
||||
return false;
|
||||
|
||||
if (drm->dmem->devmem != page->pgmap->data)
|
||||
return false;
|
||||
|
||||
return true;
|
||||
return is_device_private_page(page) && drm->dmem == page_to_dmem(page);
|
||||
}
|
||||
|
||||
void
|
||||
|
@ -867,7 +858,7 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
|
|||
continue;
|
||||
}
|
||||
|
||||
chunk = (void *)hmm_devmem_page_get_drvdata(page);
|
||||
chunk = page->zone_device_data;
|
||||
addr = page_to_pfn(page) - chunk->pfn_first;
|
||||
addr = (addr + chunk->bo->bo.mem.start) << PAGE_SHIFT;
|
||||
|
||||
|
|
|
@ -649,7 +649,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
|
|||
range.values = nouveau_svm_pfn_values;
|
||||
range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
|
||||
again:
|
||||
ret = hmm_vma_fault(&range, true);
|
||||
ret = hmm_vma_fault(&svmm->mirror, &range, true);
|
||||
if (ret == 0) {
|
||||
mutex_lock(&svmm->mutex);
|
||||
if (!hmm_vma_range_done(&range)) {
|
||||
|
|
|
@ -622,7 +622,6 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
|
|||
if (offset < reserve)
|
||||
return -EINVAL;
|
||||
nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
|
||||
pgmap->altmap_valid = false;
|
||||
} else if (nd_pfn->mode == PFN_MODE_PMEM) {
|
||||
nd_pfn->npfns = PFN_SECTION_ALIGN_UP((resource_size(res)
|
||||
- offset) / PAGE_SIZE);
|
||||
|
@ -634,7 +633,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
|
|||
memcpy(altmap, &__altmap, sizeof(*altmap));
|
||||
altmap->free = PHYS_PFN(offset - reserve);
|
||||
altmap->alloc = 0;
|
||||
pgmap->altmap_valid = true;
|
||||
pgmap->flags |= PGMAP_ALTMAP_VALID;
|
||||
} else
|
||||
return -ENXIO;
|
||||
|
||||
|
|
|
@ -303,24 +303,24 @@ static const struct attribute_group *pmem_attribute_groups[] = {
|
|||
NULL,
|
||||
};
|
||||
|
||||
static void __pmem_release_queue(struct percpu_ref *ref)
|
||||
static void pmem_pagemap_cleanup(struct dev_pagemap *pgmap)
|
||||
{
|
||||
struct request_queue *q;
|
||||
struct request_queue *q =
|
||||
container_of(pgmap->ref, struct request_queue, q_usage_counter);
|
||||
|
||||
q = container_of(ref, typeof(*q), q_usage_counter);
|
||||
blk_cleanup_queue(q);
|
||||
}
|
||||
|
||||
static void pmem_release_queue(void *ref)
|
||||
static void pmem_release_queue(void *pgmap)
|
||||
{
|
||||
__pmem_release_queue(ref);
|
||||
pmem_pagemap_cleanup(pgmap);
|
||||
}
|
||||
|
||||
static void pmem_freeze_queue(struct percpu_ref *ref)
|
||||
static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
|
||||
{
|
||||
struct request_queue *q;
|
||||
struct request_queue *q =
|
||||
container_of(pgmap->ref, struct request_queue, q_usage_counter);
|
||||
|
||||
q = container_of(ref, typeof(*q), q_usage_counter);
|
||||
blk_freeze_queue_start(q);
|
||||
}
|
||||
|
||||
|
@ -334,26 +334,16 @@ static void pmem_release_disk(void *__pmem)
|
|||
put_disk(pmem->disk);
|
||||
}
|
||||
|
||||
static void pmem_release_pgmap_ops(void *__pgmap)
|
||||
{
|
||||
dev_pagemap_put_ops();
|
||||
}
|
||||
|
||||
static void fsdax_pagefree(struct page *page, void *data)
|
||||
static void pmem_pagemap_page_free(struct page *page)
|
||||
{
|
||||
wake_up_var(&page->_refcount);
|
||||
}
|
||||
|
||||
static int setup_pagemap_fsdax(struct device *dev, struct dev_pagemap *pgmap)
|
||||
{
|
||||
dev_pagemap_get_ops();
|
||||
if (devm_add_action_or_reset(dev, pmem_release_pgmap_ops, pgmap))
|
||||
return -ENOMEM;
|
||||
pgmap->type = MEMORY_DEVICE_FS_DAX;
|
||||
pgmap->page_free = fsdax_pagefree;
|
||||
|
||||
return 0;
|
||||
}
|
||||
static const struct dev_pagemap_ops fsdax_pagemap_ops = {
|
||||
.page_free = pmem_pagemap_page_free,
|
||||
.kill = pmem_pagemap_kill,
|
||||
.cleanup = pmem_pagemap_cleanup,
|
||||
};
|
||||
|
||||
static int pmem_attach_disk(struct device *dev,
|
||||
struct nd_namespace_common *ndns)
|
||||
|
@ -409,11 +399,9 @@ static int pmem_attach_disk(struct device *dev,
|
|||
|
||||
pmem->pfn_flags = PFN_DEV;
|
||||
pmem->pgmap.ref = &q->q_usage_counter;
|
||||
pmem->pgmap.kill = pmem_freeze_queue;
|
||||
pmem->pgmap.cleanup = __pmem_release_queue;
|
||||
if (is_nd_pfn(dev)) {
|
||||
if (setup_pagemap_fsdax(dev, &pmem->pgmap))
|
||||
return -ENOMEM;
|
||||
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
|
||||
pmem->pgmap.ops = &fsdax_pagemap_ops;
|
||||
addr = devm_memremap_pages(dev, &pmem->pgmap);
|
||||
pfn_sb = nd_pfn->pfn_sb;
|
||||
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
|
||||
|
@ -424,15 +412,14 @@ static int pmem_attach_disk(struct device *dev,
|
|||
bb_res.start += pmem->data_offset;
|
||||
} else if (pmem_should_map_pages(dev)) {
|
||||
memcpy(&pmem->pgmap.res, &nsio->res, sizeof(pmem->pgmap.res));
|
||||
pmem->pgmap.altmap_valid = false;
|
||||
if (setup_pagemap_fsdax(dev, &pmem->pgmap))
|
||||
return -ENOMEM;
|
||||
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
|
||||
pmem->pgmap.ops = &fsdax_pagemap_ops;
|
||||
addr = devm_memremap_pages(dev, &pmem->pgmap);
|
||||
pmem->pfn_flags |= PFN_MAP;
|
||||
memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
|
||||
} else {
|
||||
if (devm_add_action_or_reset(dev, pmem_release_queue,
|
||||
&q->q_usage_counter))
|
||||
&pmem->pgmap))
|
||||
return -ENOMEM;
|
||||
addr = devm_memremap(dev, pmem->phys_addr,
|
||||
pmem->size, ARCH_MEMREMAP_PMEM);
|
||||
|
|
|
@ -25,12 +25,6 @@ struct pci_p2pdma {
|
|||
bool p2pmem_published;
|
||||
};
|
||||
|
||||
struct p2pdma_pagemap {
|
||||
struct dev_pagemap pgmap;
|
||||
struct percpu_ref ref;
|
||||
struct completion ref_done;
|
||||
};
|
||||
|
||||
static ssize_t size_show(struct device *dev, struct device_attribute *attr,
|
||||
char *buf)
|
||||
{
|
||||
|
@ -79,31 +73,6 @@ static const struct attribute_group p2pmem_group = {
|
|||
.name = "p2pmem",
|
||||
};
|
||||
|
||||
static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref)
|
||||
{
|
||||
return container_of(ref, struct p2pdma_pagemap, ref);
|
||||
}
|
||||
|
||||
static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
|
||||
{
|
||||
struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
|
||||
|
||||
complete(&p2p_pgmap->ref_done);
|
||||
}
|
||||
|
||||
static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
|
||||
{
|
||||
percpu_ref_kill(ref);
|
||||
}
|
||||
|
||||
static void pci_p2pdma_percpu_cleanup(struct percpu_ref *ref)
|
||||
{
|
||||
struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
|
||||
|
||||
wait_for_completion(&p2p_pgmap->ref_done);
|
||||
percpu_ref_exit(&p2p_pgmap->ref);
|
||||
}
|
||||
|
||||
static void pci_p2pdma_release(void *data)
|
||||
{
|
||||
struct pci_dev *pdev = data;
|
||||
|
@ -166,7 +135,6 @@ out:
|
|||
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
|
||||
u64 offset)
|
||||
{
|
||||
struct p2pdma_pagemap *p2p_pgmap;
|
||||
struct dev_pagemap *pgmap;
|
||||
void *addr;
|
||||
int error;
|
||||
|
@ -189,27 +157,15 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
|
|||
return error;
|
||||
}
|
||||
|
||||
p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
|
||||
if (!p2p_pgmap)
|
||||
pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
|
||||
if (!pgmap)
|
||||
return -ENOMEM;
|
||||
|
||||
init_completion(&p2p_pgmap->ref_done);
|
||||
error = percpu_ref_init(&p2p_pgmap->ref,
|
||||
pci_p2pdma_percpu_release, 0, GFP_KERNEL);
|
||||
if (error)
|
||||
goto pgmap_free;
|
||||
|
||||
pgmap = &p2p_pgmap->pgmap;
|
||||
|
||||
pgmap->res.start = pci_resource_start(pdev, bar) + offset;
|
||||
pgmap->res.end = pgmap->res.start + size - 1;
|
||||
pgmap->res.flags = pci_resource_flags(pdev, bar);
|
||||
pgmap->ref = &p2p_pgmap->ref;
|
||||
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
|
||||
pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
|
||||
pci_resource_start(pdev, bar);
|
||||
pgmap->kill = pci_p2pdma_percpu_kill;
|
||||
pgmap->cleanup = pci_p2pdma_percpu_cleanup;
|
||||
|
||||
addr = devm_memremap_pages(&pdev->dev, pgmap);
|
||||
if (IS_ERR(addr)) {
|
||||
|
@ -220,7 +176,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
|
|||
error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
|
||||
pci_bus_address(pdev, bar) + offset,
|
||||
resource_size(&pgmap->res), dev_to_node(&pdev->dev),
|
||||
&p2p_pgmap->ref);
|
||||
pgmap->ref);
|
||||
if (error)
|
||||
goto pages_free;
|
||||
|
||||
|
@ -232,7 +188,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
|
|||
pages_free:
|
||||
devm_memunmap_pages(&pdev->dev, pgmap);
|
||||
pgmap_free:
|
||||
devm_kfree(&pdev->dev, p2p_pgmap);
|
||||
devm_kfree(&pdev->dev, pgmap);
|
||||
return error;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
|
||||
|
|
|
@ -1322,7 +1322,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
|
|||
if (pm->show_pfn)
|
||||
frame = pte_pfn(pte);
|
||||
flags |= PM_PRESENT;
|
||||
page = _vm_normal_page(vma, addr, pte, true);
|
||||
page = vm_normal_page(vma, addr, pte);
|
||||
if (pte_soft_dirty(pte))
|
||||
flags |= PM_SOFT_DIRTY;
|
||||
} else if (is_swap_pte(pte)) {
|
||||
|
|
|
@ -21,8 +21,8 @@
|
|||
*
|
||||
* HMM address space mirroring API:
|
||||
*
|
||||
* Use HMM address space mirroring if you want to mirror range of the CPU page
|
||||
* table of a process into a device page table. Here, "mirror" means "keep
|
||||
* Use HMM address space mirroring if you want to mirror a range of the CPU
|
||||
* page tables of a process into a device page table. Here, "mirror" means "keep
|
||||
* synchronized". Prerequisites: the device must provide the ability to write-
|
||||
* protect its page tables (at PAGE_SIZE granularity), and must be able to
|
||||
* recover from the resulting potential page faults.
|
||||
|
@ -62,7 +62,7 @@
|
|||
#include <linux/kconfig.h>
|
||||
#include <asm/pgtable.h>
|
||||
|
||||
#if IS_ENABLED(CONFIG_HMM)
|
||||
#ifdef CONFIG_HMM_MIRROR
|
||||
|
||||
#include <linux/device.h>
|
||||
#include <linux/migrate.h>
|
||||
|
@ -82,19 +82,18 @@
|
|||
* @mirrors_sem: read/write semaphore protecting the mirrors list
|
||||
* @wq: wait queue for user waiting on a range invalidation
|
||||
* @notifiers: count of active mmu notifiers
|
||||
* @dead: is the mm dead ?
|
||||
*/
|
||||
struct hmm {
|
||||
struct mm_struct *mm;
|
||||
struct kref kref;
|
||||
struct mutex lock;
|
||||
spinlock_t ranges_lock;
|
||||
struct list_head ranges;
|
||||
struct list_head mirrors;
|
||||
struct mmu_notifier mmu_notifier;
|
||||
struct rw_semaphore mirrors_sem;
|
||||
wait_queue_head_t wq;
|
||||
struct rcu_head rcu;
|
||||
long notifiers;
|
||||
bool dead;
|
||||
};
|
||||
|
||||
/*
|
||||
|
@ -105,10 +104,11 @@ struct hmm {
|
|||
* HMM_PFN_WRITE: CPU page table has write permission set
|
||||
* HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
|
||||
*
|
||||
* The driver provide a flags array, if driver valid bit for an entry is bit
|
||||
* 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
|
||||
* The driver provides a flags array for mapping page protections to device
|
||||
* PTE bits. If the driver valid bit for an entry is bit 3,
|
||||
* i.e., (entry & (1 << 3)), then the driver must provide
|
||||
* an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
|
||||
* Same logic apply to all flags. This is same idea as vm_page_prot in vma
|
||||
* Same logic apply to all flags. This is the same idea as vm_page_prot in vma
|
||||
* except that this is per device driver rather than per architecture.
|
||||
*/
|
||||
enum hmm_pfn_flag_e {
|
||||
|
@ -129,13 +129,13 @@ enum hmm_pfn_flag_e {
|
|||
* be mirrored by a device, because the entry will never have HMM_PFN_VALID
|
||||
* set and the pfn value is undefined.
|
||||
*
|
||||
* Driver provide entry value for none entry, error entry and special entry,
|
||||
* driver can alias (ie use same value for error and special for instance). It
|
||||
* should not alias none and error or special.
|
||||
* Driver provides values for none entry, error entry, and special entry.
|
||||
* Driver can alias (i.e., use same value) error and special, but
|
||||
* it should not alias none with error or special.
|
||||
*
|
||||
* HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
|
||||
* hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
|
||||
* hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
|
||||
* hmm_range.values[HMM_PFN_NONE] if there is no CPU page table entry,
|
||||
* hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
|
||||
*/
|
||||
enum hmm_pfn_value_e {
|
||||
|
@ -158,6 +158,7 @@ enum hmm_pfn_value_e {
|
|||
* @values: pfn value for some special case (none, special, error, ...)
|
||||
* @default_flags: default flags for the range (write, read, ... see hmm doc)
|
||||
* @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
|
||||
* @page_shift: device virtual address shift value (should be >= PAGE_SHIFT)
|
||||
* @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
|
||||
* @valid: pfns array did not change since it has been fill by an HMM function
|
||||
*/
|
||||
|
@ -180,7 +181,7 @@ struct hmm_range {
|
|||
/*
|
||||
* hmm_range_page_shift() - return the page shift for the range
|
||||
* @range: range being queried
|
||||
* Returns: page shift (page size = 1 << page shift) for the range
|
||||
* Return: page shift (page size = 1 << page shift) for the range
|
||||
*/
|
||||
static inline unsigned hmm_range_page_shift(const struct hmm_range *range)
|
||||
{
|
||||
|
@ -190,7 +191,7 @@ static inline unsigned hmm_range_page_shift(const struct hmm_range *range)
|
|||
/*
|
||||
* hmm_range_page_size() - return the page size for the range
|
||||
* @range: range being queried
|
||||
* Returns: page size for the range in bytes
|
||||
* Return: page size for the range in bytes
|
||||
*/
|
||||
static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
|
||||
{
|
||||
|
@ -201,28 +202,19 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
|
|||
* hmm_range_wait_until_valid() - wait for range to be valid
|
||||
* @range: range affected by invalidation to wait on
|
||||
* @timeout: time out for wait in ms (ie abort wait after that period of time)
|
||||
* Returns: true if the range is valid, false otherwise.
|
||||
* Return: true if the range is valid, false otherwise.
|
||||
*/
|
||||
static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
|
||||
unsigned long timeout)
|
||||
{
|
||||
/* Check if mm is dead ? */
|
||||
if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
|
||||
range->valid = false;
|
||||
return false;
|
||||
}
|
||||
if (range->valid)
|
||||
return true;
|
||||
wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
|
||||
msecs_to_jiffies(timeout));
|
||||
/* Return current valid status just in case we get lucky */
|
||||
return range->valid;
|
||||
return wait_event_timeout(range->hmm->wq, range->valid,
|
||||
msecs_to_jiffies(timeout)) != 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* hmm_range_valid() - test if a range is valid or not
|
||||
* @range: range
|
||||
* Returns: true if the range is valid, false otherwise.
|
||||
* Return: true if the range is valid, false otherwise.
|
||||
*/
|
||||
static inline bool hmm_range_valid(struct hmm_range *range)
|
||||
{
|
||||
|
@ -233,7 +225,7 @@ static inline bool hmm_range_valid(struct hmm_range *range)
|
|||
* hmm_device_entry_to_page() - return struct page pointed to by a device entry
|
||||
* @range: range use to decode device entry value
|
||||
* @entry: device entry value to get corresponding struct page from
|
||||
* Returns: struct page pointer if entry is a valid, NULL otherwise
|
||||
* Return: struct page pointer if entry is a valid, NULL otherwise
|
||||
*
|
||||
* If the device entry is valid (ie valid flag set) then return the struct page
|
||||
* matching the entry value. Otherwise return NULL.
|
||||
|
@ -256,7 +248,7 @@ static inline struct page *hmm_device_entry_to_page(const struct hmm_range *rang
|
|||
* hmm_device_entry_to_pfn() - return pfn value store in a device entry
|
||||
* @range: range use to decode device entry value
|
||||
* @entry: device entry to extract pfn from
|
||||
* Returns: pfn value if device entry is valid, -1UL otherwise
|
||||
* Return: pfn value if device entry is valid, -1UL otherwise
|
||||
*/
|
||||
static inline unsigned long
|
||||
hmm_device_entry_to_pfn(const struct hmm_range *range, uint64_t pfn)
|
||||
|
@ -276,7 +268,7 @@ hmm_device_entry_to_pfn(const struct hmm_range *range, uint64_t pfn)
|
|||
* hmm_device_entry_from_page() - create a valid device entry for a page
|
||||
* @range: range use to encode HMM pfn value
|
||||
* @page: page for which to create the device entry
|
||||
* Returns: valid device entry for the page
|
||||
* Return: valid device entry for the page
|
||||
*/
|
||||
static inline uint64_t hmm_device_entry_from_page(const struct hmm_range *range,
|
||||
struct page *page)
|
||||
|
@ -289,7 +281,7 @@ static inline uint64_t hmm_device_entry_from_page(const struct hmm_range *range,
|
|||
* hmm_device_entry_from_pfn() - create a valid device entry value from pfn
|
||||
* @range: range use to encode HMM pfn value
|
||||
* @pfn: pfn value for which to create the device entry
|
||||
* Returns: valid device entry for the pfn
|
||||
* Return: valid device entry for the pfn
|
||||
*/
|
||||
static inline uint64_t hmm_device_entry_from_pfn(const struct hmm_range *range,
|
||||
unsigned long pfn)
|
||||
|
@ -332,9 +324,6 @@ static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
|
|||
return hmm_device_entry_from_pfn(range, pfn);
|
||||
}
|
||||
|
||||
|
||||
|
||||
#if IS_ENABLED(CONFIG_HMM_MIRROR)
|
||||
/*
|
||||
* Mirroring: how to synchronize device page table with CPU page table.
|
||||
*
|
||||
|
@ -394,7 +383,7 @@ enum hmm_update_event {
|
|||
};
|
||||
|
||||
/*
|
||||
* struct hmm_update - HMM update informations for callback
|
||||
* struct hmm_update - HMM update information for callback
|
||||
*
|
||||
* @start: virtual start address of the range to update
|
||||
* @end: virtual end address of the range to update
|
||||
|
@ -418,17 +407,18 @@ struct hmm_mirror_ops {
|
|||
*
|
||||
* @mirror: pointer to struct hmm_mirror
|
||||
*
|
||||
* This is called when the mm_struct is being released.
|
||||
* The callback should make sure no references to the mirror occur
|
||||
* after the callback returns.
|
||||
* This is called when the mm_struct is being released. The callback
|
||||
* must ensure that all access to any pages obtained from this mirror
|
||||
* is halted before the callback returns. All future access should
|
||||
* fault.
|
||||
*/
|
||||
void (*release)(struct hmm_mirror *mirror);
|
||||
|
||||
/* sync_cpu_device_pagetables() - synchronize page tables
|
||||
*
|
||||
* @mirror: pointer to struct hmm_mirror
|
||||
* @update: update informations (see struct hmm_update)
|
||||
* Returns: -EAGAIN if update.blockable false and callback need to
|
||||
* @update: update information (see struct hmm_update)
|
||||
* Return: -EAGAIN if update.blockable false and callback need to
|
||||
* block, 0 otherwise.
|
||||
*
|
||||
* This callback ultimately originates from mmu_notifiers when the CPU
|
||||
|
@ -464,36 +454,11 @@ struct hmm_mirror {
|
|||
int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
|
||||
void hmm_mirror_unregister(struct hmm_mirror *mirror);
|
||||
|
||||
/*
|
||||
* hmm_mirror_mm_is_alive() - test if mm is still alive
|
||||
* @mirror: the HMM mm mirror for which we want to lock the mmap_sem
|
||||
* Returns: false if the mm is dead, true otherwise
|
||||
*
|
||||
* This is an optimization it will not accurately always return -EINVAL if the
|
||||
* mm is dead ie there can be false negative (process is being kill but HMM is
|
||||
* not yet inform of that). It is only intented to be use to optimize out case
|
||||
* where driver is about to do something time consuming and it would be better
|
||||
* to skip it if the mm is dead.
|
||||
*/
|
||||
static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
|
||||
{
|
||||
struct mm_struct *mm;
|
||||
|
||||
if (!mirror || !mirror->hmm)
|
||||
return false;
|
||||
mm = READ_ONCE(mirror->hmm->mm);
|
||||
if (mirror->hmm->dead || !mm)
|
||||
return false;
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Please see Documentation/vm/hmm.rst for how to use the range API.
|
||||
*/
|
||||
int hmm_range_register(struct hmm_range *range,
|
||||
struct mm_struct *mm,
|
||||
struct hmm_mirror *mirror,
|
||||
unsigned long start,
|
||||
unsigned long end,
|
||||
unsigned page_shift);
|
||||
|
@ -529,7 +494,8 @@ static inline bool hmm_vma_range_done(struct hmm_range *range)
|
|||
}
|
||||
|
||||
/* This is a temporary helper to avoid merge conflict between trees. */
|
||||
static inline int hmm_vma_fault(struct hmm_range *range, bool block)
|
||||
static inline int hmm_vma_fault(struct hmm_mirror *mirror,
|
||||
struct hmm_range *range, bool block)
|
||||
{
|
||||
long ret;
|
||||
|
||||
|
@ -542,7 +508,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
|
|||
range->default_flags = 0;
|
||||
range->pfn_flags_mask = -1UL;
|
||||
|
||||
ret = hmm_range_register(range, range->vma->vm_mm,
|
||||
ret = hmm_range_register(range, mirror,
|
||||
range->start, range->end,
|
||||
PAGE_SHIFT);
|
||||
if (ret)
|
||||
|
@ -561,7 +527,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
|
|||
ret = hmm_range_fault(range, block);
|
||||
if (ret <= 0) {
|
||||
if (ret == -EBUSY || !ret) {
|
||||
/* Same as above drop mmap_sem to match old API. */
|
||||
/* Same as above, drop mmap_sem to match old API. */
|
||||
up_read(&range->vma->vm_mm->mmap_sem);
|
||||
ret = -EBUSY;
|
||||
} else if (ret == -EAGAIN)
|
||||
|
@ -573,208 +539,12 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
|
|||
}
|
||||
|
||||
/* Below are for HMM internal use only! Not to be used by device driver! */
|
||||
void hmm_mm_destroy(struct mm_struct *mm);
|
||||
|
||||
static inline void hmm_mm_init(struct mm_struct *mm)
|
||||
{
|
||||
mm->hmm = NULL;
|
||||
}
|
||||
#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
|
||||
static inline void hmm_mm_destroy(struct mm_struct *mm) {}
|
||||
static inline void hmm_mm_init(struct mm_struct *mm) {}
|
||||
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
|
||||
|
||||
#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
|
||||
struct hmm_devmem;
|
||||
|
||||
struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
|
||||
unsigned long addr);
|
||||
|
||||
/*
|
||||
* struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
|
||||
*
|
||||
* @free: call when refcount on page reach 1 and thus is no longer use
|
||||
* @fault: call when there is a page fault to unaddressable memory
|
||||
*
|
||||
* Both callback happens from page_free() and page_fault() callback of struct
|
||||
* dev_pagemap respectively. See include/linux/memremap.h for more details on
|
||||
* those.
|
||||
*
|
||||
* The hmm_devmem_ops callback are just here to provide a coherent and
|
||||
* uniq API to device driver and device driver should not register their
|
||||
* own page_free() or page_fault() but rely on the hmm_devmem_ops call-
|
||||
* back.
|
||||
*/
|
||||
struct hmm_devmem_ops {
|
||||
/*
|
||||
* free() - free a device page
|
||||
* @devmem: device memory structure (see struct hmm_devmem)
|
||||
* @page: pointer to struct page being freed
|
||||
*
|
||||
* Call back occurs whenever a device page refcount reach 1 which
|
||||
* means that no one is holding any reference on the page anymore
|
||||
* (ZONE_DEVICE page have an elevated refcount of 1 as default so
|
||||
* that they are not release to the general page allocator).
|
||||
*
|
||||
* Note that callback has exclusive ownership of the page (as no
|
||||
* one is holding any reference).
|
||||
*/
|
||||
void (*free)(struct hmm_devmem *devmem, struct page *page);
|
||||
/*
|
||||
* fault() - CPU page fault or get user page (GUP)
|
||||
* @devmem: device memory structure (see struct hmm_devmem)
|
||||
* @vma: virtual memory area containing the virtual address
|
||||
* @addr: virtual address that faulted or for which there is a GUP
|
||||
* @page: pointer to struct page backing virtual address (unreliable)
|
||||
* @flags: FAULT_FLAG_* (see include/linux/mm.h)
|
||||
* @pmdp: page middle directory
|
||||
* Returns: VM_FAULT_MINOR/MAJOR on success or one of VM_FAULT_ERROR
|
||||
* on error
|
||||
*
|
||||
* The callback occurs whenever there is a CPU page fault or GUP on a
|
||||
* virtual address. This means that the device driver must migrate the
|
||||
* page back to regular memory (CPU accessible).
|
||||
*
|
||||
* The device driver is free to migrate more than one page from the
|
||||
* fault() callback as an optimization. However if device decide to
|
||||
* migrate more than one page it must always priotirize the faulting
|
||||
* address over the others.
|
||||
*
|
||||
* The struct page pointer is only given as an hint to allow quick
|
||||
* lookup of internal device driver data. A concurrent migration
|
||||
* might have already free that page and the virtual address might
|
||||
* not longer be back by it. So it should not be modified by the
|
||||
* callback.
|
||||
*
|
||||
* Note that mmap semaphore is held in read mode at least when this
|
||||
* callback occurs, hence the vma is valid upon callback entry.
|
||||
*/
|
||||
vm_fault_t (*fault)(struct hmm_devmem *devmem,
|
||||
struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
const struct page *page,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp);
|
||||
};
|
||||
|
||||
/*
|
||||
* struct hmm_devmem - track device memory
|
||||
*
|
||||
* @completion: completion object for device memory
|
||||
* @pfn_first: first pfn for this resource (set by hmm_devmem_add())
|
||||
* @pfn_last: last pfn for this resource (set by hmm_devmem_add())
|
||||
* @resource: IO resource reserved for this chunk of memory
|
||||
* @pagemap: device page map for that chunk
|
||||
* @device: device to bind resource to
|
||||
* @ops: memory operations callback
|
||||
* @ref: per CPU refcount
|
||||
* @page_fault: callback when CPU fault on an unaddressable device page
|
||||
*
|
||||
* This an helper structure for device drivers that do not wish to implement
|
||||
* the gory details related to hotplugging new memoy and allocating struct
|
||||
* pages.
|
||||
*
|
||||
* Device drivers can directly use ZONE_DEVICE memory on their own if they
|
||||
* wish to do so.
|
||||
*
|
||||
* The page_fault() callback must migrate page back, from device memory to
|
||||
* system memory, so that the CPU can access it. This might fail for various
|
||||
* reasons (device issues, device have been unplugged, ...). When such error
|
||||
* conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
|
||||
* set the CPU page table entry to "poisoned".
|
||||
*
|
||||
* Note that because memory cgroup charges are transferred to the device memory,
|
||||
* this should never fail due to memory restrictions. However, allocation
|
||||
* of a regular system page might still fail because we are out of memory. If
|
||||
* that happens, the page_fault() callback must return VM_FAULT_OOM.
|
||||
*
|
||||
* The page_fault() callback can also try to migrate back multiple pages in one
|
||||
* chunk, as an optimization. It must, however, prioritize the faulting address
|
||||
* over all the others.
|
||||
*/
|
||||
typedef vm_fault_t (*dev_page_fault_t)(struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
const struct page *page,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp);
|
||||
|
||||
struct hmm_devmem {
|
||||
struct completion completion;
|
||||
unsigned long pfn_first;
|
||||
unsigned long pfn_last;
|
||||
struct resource *resource;
|
||||
struct device *device;
|
||||
struct dev_pagemap pagemap;
|
||||
const struct hmm_devmem_ops *ops;
|
||||
struct percpu_ref ref;
|
||||
dev_page_fault_t page_fault;
|
||||
};
|
||||
|
||||
/*
|
||||
* To add (hotplug) device memory, HMM assumes that there is no real resource
|
||||
* that reserves a range in the physical address space (this is intended to be
|
||||
* use by unaddressable device memory). It will reserve a physical range big
|
||||
* enough and allocate struct page for it.
|
||||
*
|
||||
* The device driver can wrap the hmm_devmem struct inside a private device
|
||||
* driver struct.
|
||||
*/
|
||||
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
|
||||
struct device *device,
|
||||
unsigned long size);
|
||||
struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
|
||||
struct device *device,
|
||||
struct resource *res);
|
||||
|
||||
/*
|
||||
* hmm_devmem_page_set_drvdata - set per-page driver data field
|
||||
*
|
||||
* @page: pointer to struct page
|
||||
* @data: driver data value to set
|
||||
*
|
||||
* Because page can not be on lru we have an unsigned long that driver can use
|
||||
* to store a per page field. This just a simple helper to do that.
|
||||
*/
|
||||
static inline void hmm_devmem_page_set_drvdata(struct page *page,
|
||||
unsigned long data)
|
||||
{
|
||||
page->hmm_data = data;
|
||||
}
|
||||
|
||||
/*
|
||||
* hmm_devmem_page_get_drvdata - get per page driver data field
|
||||
*
|
||||
* @page: pointer to struct page
|
||||
* Return: driver data value
|
||||
*/
|
||||
static inline unsigned long hmm_devmem_page_get_drvdata(const struct page *page)
|
||||
{
|
||||
return page->hmm_data;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* struct hmm_device - fake device to hang device memory onto
|
||||
*
|
||||
* @device: device struct
|
||||
* @minor: device minor number
|
||||
*/
|
||||
struct hmm_device {
|
||||
struct device device;
|
||||
unsigned int minor;
|
||||
};
|
||||
|
||||
/*
|
||||
* A device driver that wants to handle multiple devices memory through a
|
||||
* single fake device can use hmm_device to do so. This is purely a helper and
|
||||
* it is not strictly needed, in order to make use of any HMM functionality.
|
||||
*/
|
||||
struct hmm_device *hmm_device_new(void *drvdata);
|
||||
void hmm_device_put(struct hmm_device *hmm_device);
|
||||
#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
|
||||
#else /* IS_ENABLED(CONFIG_HMM) */
|
||||
static inline void hmm_mm_destroy(struct mm_struct *mm) {}
|
||||
static inline void hmm_mm_init(struct mm_struct *mm) {}
|
||||
#endif /* IS_ENABLED(CONFIG_HMM) */
|
||||
|
||||
#endif /* LINUX_HMM_H */
|
||||
|
|
|
@ -133,8 +133,7 @@ enum {
|
|||
IORES_DESC_PERSISTENT_MEMORY = 4,
|
||||
IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
|
||||
IORES_DESC_DEVICE_PRIVATE_MEMORY = 6,
|
||||
IORES_DESC_DEVICE_PUBLIC_MEMORY = 7,
|
||||
IORES_DESC_RESERVED = 8,
|
||||
IORES_DESC_RESERVED = 7,
|
||||
};
|
||||
|
||||
/*
|
||||
|
@ -296,6 +295,8 @@ static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
|
|||
return (r1->start <= r2->end && r1->end >= r2->start);
|
||||
}
|
||||
|
||||
struct resource *devm_request_free_mem_region(struct device *dev,
|
||||
struct resource *base, unsigned long size);
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
#endif /* _LINUX_IOPORT_H */
|
||||
|
|
|
@ -37,13 +37,6 @@ struct vmem_altmap {
|
|||
* A more complete discussion of unaddressable memory may be found in
|
||||
* include/linux/hmm.h and Documentation/vm/hmm.rst.
|
||||
*
|
||||
* MEMORY_DEVICE_PUBLIC:
|
||||
* Device memory that is cache coherent from device and CPU point of view. This
|
||||
* is use on platform that have an advance system bus (like CAPI or CCIX). A
|
||||
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
|
||||
* type. Any page of a process can be migrated to such memory. However no one
|
||||
* should be allow to pin such memory so that it can always be evicted.
|
||||
*
|
||||
* MEMORY_DEVICE_FS_DAX:
|
||||
* Host memory that has similar access semantics as System RAM i.e. DMA
|
||||
* coherent and supports page pinning. In support of coordinating page
|
||||
|
@ -52,54 +45,84 @@ struct vmem_altmap {
|
|||
* wakeup is used to coordinate physical address space management (ex:
|
||||
* fs truncate/hole punch) vs pinned pages (ex: device dma).
|
||||
*
|
||||
* MEMORY_DEVICE_DEVDAX:
|
||||
* Host memory that has similar access semantics as System RAM i.e. DMA
|
||||
* coherent and supports page pinning. In contrast to
|
||||
* MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
|
||||
* character device.
|
||||
*
|
||||
* MEMORY_DEVICE_PCI_P2PDMA:
|
||||
* Device memory residing in a PCI BAR intended for use with Peer-to-Peer
|
||||
* transactions.
|
||||
*/
|
||||
enum memory_type {
|
||||
/* 0 is reserved to catch uninitialized type fields */
|
||||
MEMORY_DEVICE_PRIVATE = 1,
|
||||
MEMORY_DEVICE_PUBLIC,
|
||||
MEMORY_DEVICE_FS_DAX,
|
||||
MEMORY_DEVICE_DEVDAX,
|
||||
MEMORY_DEVICE_PCI_P2PDMA,
|
||||
};
|
||||
|
||||
/*
|
||||
* Additional notes about MEMORY_DEVICE_PRIVATE may be found in
|
||||
* include/linux/hmm.h and Documentation/vm/hmm.rst. There is also a brief
|
||||
* explanation in include/linux/memory_hotplug.h.
|
||||
*
|
||||
* The page_free() callback is called once the page refcount reaches 1
|
||||
* (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
|
||||
* This allows the device driver to implement its own memory management.)
|
||||
*/
|
||||
typedef void (*dev_page_free_t)(struct page *page, void *data);
|
||||
struct dev_pagemap_ops {
|
||||
/*
|
||||
* Called once the page refcount reaches 1. (ZONE_DEVICE pages never
|
||||
* reach 0 refcount unless there is a refcount bug. This allows the
|
||||
* device driver to implement its own memory management.)
|
||||
*/
|
||||
void (*page_free)(struct page *page);
|
||||
|
||||
/*
|
||||
* Transition the refcount in struct dev_pagemap to the dead state.
|
||||
*/
|
||||
void (*kill)(struct dev_pagemap *pgmap);
|
||||
|
||||
/*
|
||||
* Wait for refcount in struct dev_pagemap to be idle and reap it.
|
||||
*/
|
||||
void (*cleanup)(struct dev_pagemap *pgmap);
|
||||
|
||||
/*
|
||||
* Used for private (un-addressable) device memory only. Must migrate
|
||||
* the page back to a CPU accessible page.
|
||||
*/
|
||||
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
|
||||
};
|
||||
|
||||
#define PGMAP_ALTMAP_VALID (1 << 0)
|
||||
|
||||
/**
|
||||
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
|
||||
* @page_free: free page callback when page refcount reaches 1
|
||||
* @altmap: pre-allocated/reserved memory for vmemmap allocations
|
||||
* @res: physical address range covered by @ref
|
||||
* @ref: reference count that pins the devm_memremap_pages() mapping
|
||||
* @kill: callback to transition @ref to the dead state
|
||||
* @cleanup: callback to wait for @ref to be idle and reap it
|
||||
* @internal_ref: internal reference if @ref is not provided by the caller
|
||||
* @done: completion for @internal_ref
|
||||
* @dev: host device of the mapping for debug
|
||||
* @data: private data pointer for page_free()
|
||||
* @type: memory type: see MEMORY_* in memory_hotplug.h
|
||||
* @flags: PGMAP_* flags to specify defailed behavior
|
||||
* @ops: method table
|
||||
*/
|
||||
struct dev_pagemap {
|
||||
dev_page_free_t page_free;
|
||||
struct vmem_altmap altmap;
|
||||
bool altmap_valid;
|
||||
struct resource res;
|
||||
struct percpu_ref *ref;
|
||||
void (*kill)(struct percpu_ref *ref);
|
||||
void (*cleanup)(struct percpu_ref *ref);
|
||||
struct percpu_ref internal_ref;
|
||||
struct completion done;
|
||||
struct device *dev;
|
||||
void *data;
|
||||
enum memory_type type;
|
||||
unsigned int flags;
|
||||
u64 pci_p2pdma_bus_offset;
|
||||
const struct dev_pagemap_ops *ops;
|
||||
};
|
||||
|
||||
static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
|
||||
{
|
||||
if (pgmap->flags & PGMAP_ALTMAP_VALID)
|
||||
return &pgmap->altmap;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ZONE_DEVICE
|
||||
void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
|
||||
void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
|
||||
|
|
|
@ -937,8 +937,6 @@ static inline bool is_zone_device_page(const struct page *page)
|
|||
#endif
|
||||
|
||||
#ifdef CONFIG_DEV_PAGEMAP_OPS
|
||||
void dev_pagemap_get_ops(void);
|
||||
void dev_pagemap_put_ops(void);
|
||||
void __put_devmap_managed_page(struct page *page);
|
||||
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
|
||||
static inline bool put_devmap_managed_page(struct page *page)
|
||||
|
@ -949,7 +947,6 @@ static inline bool put_devmap_managed_page(struct page *page)
|
|||
return false;
|
||||
switch (page->pgmap->type) {
|
||||
case MEMORY_DEVICE_PRIVATE:
|
||||
case MEMORY_DEVICE_PUBLIC:
|
||||
case MEMORY_DEVICE_FS_DAX:
|
||||
__put_devmap_managed_page(page);
|
||||
return true;
|
||||
|
@ -965,12 +962,6 @@ static inline bool is_device_private_page(const struct page *page)
|
|||
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
|
||||
}
|
||||
|
||||
static inline bool is_device_public_page(const struct page *page)
|
||||
{
|
||||
return is_zone_device_page(page) &&
|
||||
page->pgmap->type == MEMORY_DEVICE_PUBLIC;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_PCI_P2PDMA
|
||||
static inline bool is_pci_p2pdma_page(const struct page *page)
|
||||
{
|
||||
|
@ -985,14 +976,6 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
|
|||
#endif /* CONFIG_PCI_P2PDMA */
|
||||
|
||||
#else /* CONFIG_DEV_PAGEMAP_OPS */
|
||||
static inline void dev_pagemap_get_ops(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void dev_pagemap_put_ops(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline bool put_devmap_managed_page(struct page *page)
|
||||
{
|
||||
return false;
|
||||
|
@ -1003,11 +986,6 @@ static inline bool is_device_private_page(const struct page *page)
|
|||
return false;
|
||||
}
|
||||
|
||||
static inline bool is_device_public_page(const struct page *page)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
static inline bool is_pci_p2pdma_page(const struct page *page)
|
||||
{
|
||||
return false;
|
||||
|
@ -1436,10 +1414,8 @@ struct zap_details {
|
|||
pgoff_t last_index; /* Highest page->index to unmap */
|
||||
};
|
||||
|
||||
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t pte, bool with_public_device);
|
||||
#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
|
||||
|
||||
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t pte);
|
||||
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
|
||||
pmd_t pmd);
|
||||
|
||||
|
|
|
@ -158,7 +158,7 @@ struct page {
|
|||
struct { /* ZONE_DEVICE pages */
|
||||
/** @pgmap: Points to the hosting device page map. */
|
||||
struct dev_pagemap *pgmap;
|
||||
unsigned long hmm_data;
|
||||
void *zone_device_data;
|
||||
unsigned long _zd_pad_1; /* uses mapping */
|
||||
};
|
||||
|
||||
|
@ -503,7 +503,7 @@ struct mm_struct {
|
|||
#endif
|
||||
struct work_struct async_put_work;
|
||||
|
||||
#if IS_ENABLED(CONFIG_HMM)
|
||||
#ifdef CONFIG_HMM_MIRROR
|
||||
/* HMM needs to track a few things per mm */
|
||||
struct hmm *hmm;
|
||||
#endif
|
||||
|
|
|
@ -129,12 +129,6 @@ static inline struct page *device_private_entry_to_page(swp_entry_t entry)
|
|||
{
|
||||
return pfn_to_page(swp_offset(entry));
|
||||
}
|
||||
|
||||
vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
swp_entry_t entry,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp);
|
||||
#else /* CONFIG_DEVICE_PRIVATE */
|
||||
static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
|
||||
{
|
||||
|
@ -164,15 +158,6 @@ static inline struct page *device_private_entry_to_page(swp_entry_t entry)
|
|||
{
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static inline vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
swp_entry_t entry,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp)
|
||||
{
|
||||
return VM_FAULT_SIGBUS;
|
||||
}
|
||||
#endif /* CONFIG_DEVICE_PRIVATE */
|
||||
|
||||
#ifdef CONFIG_MIGRATION
|
||||
|
|
|
@ -677,7 +677,6 @@ void __mmdrop(struct mm_struct *mm)
|
|||
WARN_ON_ONCE(mm == current->active_mm);
|
||||
mm_free_pgd(mm);
|
||||
destroy_context(mm);
|
||||
hmm_mm_destroy(mm);
|
||||
mmu_notifier_mm_destroy(mm);
|
||||
check_mm(mm);
|
||||
put_user_ns(mm->user_ns);
|
||||
|
|
|
@ -11,41 +11,39 @@
|
|||
#include <linux/types.h>
|
||||
#include <linux/wait_bit.h>
|
||||
#include <linux/xarray.h>
|
||||
#include <linux/hmm.h>
|
||||
|
||||
static DEFINE_XARRAY(pgmap_array);
|
||||
#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
|
||||
#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
|
||||
|
||||
#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
|
||||
vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
swp_entry_t entry,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp)
|
||||
#ifdef CONFIG_DEV_PAGEMAP_OPS
|
||||
DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
|
||||
EXPORT_SYMBOL(devmap_managed_key);
|
||||
static atomic_t devmap_managed_enable;
|
||||
|
||||
static void devmap_managed_enable_put(void *data)
|
||||
{
|
||||
struct page *page = device_private_entry_to_page(entry);
|
||||
struct hmm_devmem *devmem;
|
||||
|
||||
devmem = container_of(page->pgmap, typeof(*devmem), pagemap);
|
||||
|
||||
/*
|
||||
* The page_fault() callback must migrate page back to system memory
|
||||
* so that CPU can access it. This might fail for various reasons
|
||||
* (device issue, device was unsafely unplugged, ...). When such
|
||||
* error conditions happen, the callback must return VM_FAULT_SIGBUS.
|
||||
*
|
||||
* Note that because memory cgroup charges are accounted to the device
|
||||
* memory, this should never fail because of memory restrictions (but
|
||||
* allocation of regular system page might still fail because we are
|
||||
* out of memory).
|
||||
*
|
||||
* There is a more in-depth description of what that callback can and
|
||||
* cannot do, in include/linux/memremap.h
|
||||
*/
|
||||
return devmem->page_fault(vma, addr, page, flags, pmdp);
|
||||
if (atomic_dec_and_test(&devmap_managed_enable))
|
||||
static_branch_disable(&devmap_managed_key);
|
||||
}
|
||||
#endif /* CONFIG_DEVICE_PRIVATE */
|
||||
|
||||
static int devmap_managed_enable_get(struct device *dev, struct dev_pagemap *pgmap)
|
||||
{
|
||||
if (!pgmap->ops || !pgmap->ops->page_free) {
|
||||
WARN(1, "Missing page_free method\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
if (atomic_inc_return(&devmap_managed_enable) == 1)
|
||||
static_branch_enable(&devmap_managed_key);
|
||||
return devm_add_action_or_reset(dev, devmap_managed_enable_put, NULL);
|
||||
}
|
||||
#else
|
||||
static int devmap_managed_enable_get(struct device *dev, struct dev_pagemap *pgmap)
|
||||
{
|
||||
return -EINVAL;
|
||||
}
|
||||
#endif /* CONFIG_DEV_PAGEMAP_OPS */
|
||||
|
||||
static void pgmap_array_delete(struct resource *res)
|
||||
{
|
||||
|
@ -56,14 +54,8 @@ static void pgmap_array_delete(struct resource *res)
|
|||
|
||||
static unsigned long pfn_first(struct dev_pagemap *pgmap)
|
||||
{
|
||||
const struct resource *res = &pgmap->res;
|
||||
struct vmem_altmap *altmap = &pgmap->altmap;
|
||||
unsigned long pfn;
|
||||
|
||||
pfn = res->start >> PAGE_SHIFT;
|
||||
if (pgmap->altmap_valid)
|
||||
pfn += vmem_altmap_offset(altmap);
|
||||
return pfn;
|
||||
return (pgmap->res.start >> PAGE_SHIFT) +
|
||||
vmem_altmap_offset(pgmap_altmap(pgmap));
|
||||
}
|
||||
|
||||
static unsigned long pfn_end(struct dev_pagemap *pgmap)
|
||||
|
@ -83,6 +75,24 @@ static unsigned long pfn_next(unsigned long pfn)
|
|||
#define for_each_device_pfn(pfn, map) \
|
||||
for (pfn = pfn_first(map); pfn < pfn_end(map); pfn = pfn_next(pfn))
|
||||
|
||||
static void dev_pagemap_kill(struct dev_pagemap *pgmap)
|
||||
{
|
||||
if (pgmap->ops && pgmap->ops->kill)
|
||||
pgmap->ops->kill(pgmap);
|
||||
else
|
||||
percpu_ref_kill(pgmap->ref);
|
||||
}
|
||||
|
||||
static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
|
||||
{
|
||||
if (pgmap->ops && pgmap->ops->cleanup) {
|
||||
pgmap->ops->cleanup(pgmap);
|
||||
} else {
|
||||
wait_for_completion(&pgmap->done);
|
||||
percpu_ref_exit(pgmap->ref);
|
||||
}
|
||||
}
|
||||
|
||||
static void devm_memremap_pages_release(void *data)
|
||||
{
|
||||
struct dev_pagemap *pgmap = data;
|
||||
|
@ -92,10 +102,10 @@ static void devm_memremap_pages_release(void *data)
|
|||
unsigned long pfn;
|
||||
int nid;
|
||||
|
||||
pgmap->kill(pgmap->ref);
|
||||
dev_pagemap_kill(pgmap);
|
||||
for_each_device_pfn(pfn, pgmap)
|
||||
put_page(pfn_to_page(pfn));
|
||||
pgmap->cleanup(pgmap->ref);
|
||||
dev_pagemap_cleanup(pgmap);
|
||||
|
||||
/* pages are dead and unused, undo the arch mapping */
|
||||
align_start = res->start & ~(SECTION_SIZE - 1);
|
||||
|
@ -111,7 +121,7 @@ static void devm_memremap_pages_release(void *data)
|
|||
align_size >> PAGE_SHIFT, NULL);
|
||||
} else {
|
||||
arch_remove_memory(nid, align_start, align_size,
|
||||
pgmap->altmap_valid ? &pgmap->altmap : NULL);
|
||||
pgmap_altmap(pgmap));
|
||||
kasan_remove_zero_shadow(__va(align_start), align_size);
|
||||
}
|
||||
mem_hotplug_done();
|
||||
|
@ -122,20 +132,29 @@ static void devm_memremap_pages_release(void *data)
|
|||
"%s: failed to free all reserved pages\n", __func__);
|
||||
}
|
||||
|
||||
static void dev_pagemap_percpu_release(struct percpu_ref *ref)
|
||||
{
|
||||
struct dev_pagemap *pgmap =
|
||||
container_of(ref, struct dev_pagemap, internal_ref);
|
||||
|
||||
complete(&pgmap->done);
|
||||
}
|
||||
|
||||
/**
|
||||
* devm_memremap_pages - remap and provide memmap backing for the given resource
|
||||
* @dev: hosting device for @res
|
||||
* @pgmap: pointer to a struct dev_pagemap
|
||||
*
|
||||
* Notes:
|
||||
* 1/ At a minimum the res, ref and type members of @pgmap must be initialized
|
||||
* 1/ At a minimum the res and type members of @pgmap must be initialized
|
||||
* by the caller before passing it to this function
|
||||
*
|
||||
* 2/ The altmap field may optionally be initialized, in which case altmap_valid
|
||||
* must be set to true
|
||||
* 2/ The altmap field may optionally be initialized, in which case
|
||||
* PGMAP_ALTMAP_VALID must be set in pgmap->flags.
|
||||
*
|
||||
* 3/ pgmap->ref must be 'live' on entry and will be killed and reaped
|
||||
* at devm_memremap_pages_release() time, or if this routine fails.
|
||||
* 3/ The ref field may optionally be provided, in which pgmap->ref must be
|
||||
* 'live' on entry and will be killed and reaped at
|
||||
* devm_memremap_pages_release() time, or if this routine fails.
|
||||
*
|
||||
* 4/ res is expected to be a host memory range that could feasibly be
|
||||
* treated as a "System RAM" range, i.e. not a device mmio range, but
|
||||
|
@ -144,22 +163,66 @@ static void devm_memremap_pages_release(void *data)
|
|||
void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
|
||||
{
|
||||
resource_size_t align_start, align_size, align_end;
|
||||
struct vmem_altmap *altmap = pgmap->altmap_valid ?
|
||||
&pgmap->altmap : NULL;
|
||||
struct resource *res = &pgmap->res;
|
||||
struct dev_pagemap *conflict_pgmap;
|
||||
struct mhp_restrictions restrictions = {
|
||||
/*
|
||||
* We do not want any optional features only our own memmap
|
||||
*/
|
||||
.altmap = altmap,
|
||||
.altmap = pgmap_altmap(pgmap),
|
||||
};
|
||||
pgprot_t pgprot = PAGE_KERNEL;
|
||||
int error, nid, is_ram;
|
||||
bool need_devmap_managed = true;
|
||||
|
||||
if (!pgmap->ref || !pgmap->kill || !pgmap->cleanup) {
|
||||
WARN(1, "Missing reference count teardown definition\n");
|
||||
return ERR_PTR(-EINVAL);
|
||||
switch (pgmap->type) {
|
||||
case MEMORY_DEVICE_PRIVATE:
|
||||
if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
|
||||
WARN(1, "Device private memory not supported\n");
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
|
||||
WARN(1, "Missing migrate_to_ram method\n");
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
break;
|
||||
case MEMORY_DEVICE_FS_DAX:
|
||||
if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
|
||||
IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
|
||||
WARN(1, "File system DAX not supported\n");
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
break;
|
||||
case MEMORY_DEVICE_DEVDAX:
|
||||
case MEMORY_DEVICE_PCI_P2PDMA:
|
||||
need_devmap_managed = false;
|
||||
break;
|
||||
default:
|
||||
WARN(1, "Invalid pgmap type %d\n", pgmap->type);
|
||||
break;
|
||||
}
|
||||
|
||||
if (!pgmap->ref) {
|
||||
if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup))
|
||||
return ERR_PTR(-EINVAL);
|
||||
|
||||
init_completion(&pgmap->done);
|
||||
error = percpu_ref_init(&pgmap->internal_ref,
|
||||
dev_pagemap_percpu_release, 0, GFP_KERNEL);
|
||||
if (error)
|
||||
return ERR_PTR(error);
|
||||
pgmap->ref = &pgmap->internal_ref;
|
||||
} else {
|
||||
if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) {
|
||||
WARN(1, "Missing reference count teardown definition\n");
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
}
|
||||
|
||||
if (need_devmap_managed) {
|
||||
error = devmap_managed_enable_get(dev, pgmap);
|
||||
if (error)
|
||||
return ERR_PTR(error);
|
||||
}
|
||||
|
||||
align_start = res->start & ~(SECTION_SIZE - 1);
|
||||
|
@ -241,7 +304,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
|
|||
|
||||
zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
|
||||
move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
|
||||
align_size >> PAGE_SHIFT, altmap);
|
||||
align_size >> PAGE_SHIFT, pgmap_altmap(pgmap));
|
||||
}
|
||||
|
||||
mem_hotplug_done();
|
||||
|
@ -271,9 +334,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
|
|||
err_pfn_remap:
|
||||
pgmap_array_delete(res);
|
||||
err_array:
|
||||
pgmap->kill(pgmap->ref);
|
||||
pgmap->cleanup(pgmap->ref);
|
||||
|
||||
dev_pagemap_kill(pgmap);
|
||||
dev_pagemap_cleanup(pgmap);
|
||||
return ERR_PTR(error);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(devm_memremap_pages);
|
||||
|
@ -287,7 +349,9 @@ EXPORT_SYMBOL_GPL(devm_memunmap_pages);
|
|||
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
|
||||
{
|
||||
/* number of pfns from base where pfn_to_page() is valid */
|
||||
return altmap->reserve + altmap->free;
|
||||
if (altmap)
|
||||
return altmap->reserve + altmap->free;
|
||||
return 0;
|
||||
}
|
||||
|
||||
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
|
||||
|
@ -329,28 +393,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
|
|||
EXPORT_SYMBOL_GPL(get_dev_pagemap);
|
||||
|
||||
#ifdef CONFIG_DEV_PAGEMAP_OPS
|
||||
DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
|
||||
EXPORT_SYMBOL(devmap_managed_key);
|
||||
static atomic_t devmap_enable;
|
||||
|
||||
/*
|
||||
* Toggle the static key for ->page_free() callbacks when dev_pagemap
|
||||
* pages go idle.
|
||||
*/
|
||||
void dev_pagemap_get_ops(void)
|
||||
{
|
||||
if (atomic_inc_return(&devmap_enable) == 1)
|
||||
static_branch_enable(&devmap_managed_key);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
|
||||
|
||||
void dev_pagemap_put_ops(void)
|
||||
{
|
||||
if (atomic_dec_and_test(&devmap_enable))
|
||||
static_branch_disable(&devmap_managed_key);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
|
||||
|
||||
void __put_devmap_managed_page(struct page *page)
|
||||
{
|
||||
int count = page_ref_dec_return(page);
|
||||
|
@ -366,7 +408,7 @@ void __put_devmap_managed_page(struct page *page)
|
|||
|
||||
mem_cgroup_uncharge(page);
|
||||
|
||||
page->pgmap->page_free(page, page->pgmap->data);
|
||||
page->pgmap->ops->page_free(page);
|
||||
} else if (!count)
|
||||
__put_page(page);
|
||||
}
|
||||
|
|
|
@ -1628,6 +1628,45 @@ void resource_list_free(struct list_head *head)
|
|||
}
|
||||
EXPORT_SYMBOL(resource_list_free);
|
||||
|
||||
#ifdef CONFIG_DEVICE_PRIVATE
|
||||
/**
|
||||
* devm_request_free_mem_region - find free region for device private memory
|
||||
*
|
||||
* @dev: device struct to bind the resource to
|
||||
* @size: size in bytes of the device memory to add
|
||||
* @base: resource tree to look in
|
||||
*
|
||||
* This function tries to find an empty range of physical address big enough to
|
||||
* contain the new resource, so that it can later be hotplugged as ZONE_DEVICE
|
||||
* memory, which in turn allocates struct pages.
|
||||
*/
|
||||
struct resource *devm_request_free_mem_region(struct device *dev,
|
||||
struct resource *base, unsigned long size)
|
||||
{
|
||||
resource_size_t end, addr;
|
||||
struct resource *res;
|
||||
|
||||
size = ALIGN(size, 1UL << PA_SECTION_SHIFT);
|
||||
end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1);
|
||||
addr = end - size + 1UL;
|
||||
|
||||
for (; addr > size && addr >= base->start; addr -= size) {
|
||||
if (region_intersects(addr, size, 0, IORES_DESC_NONE) !=
|
||||
REGION_DISJOINT)
|
||||
continue;
|
||||
|
||||
res = devm_request_mem_region(dev, addr, size, dev_name(dev));
|
||||
if (!res)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
|
||||
return res;
|
||||
}
|
||||
|
||||
return ERR_PTR(-ERANGE);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(devm_request_free_mem_region);
|
||||
#endif /* CONFIG_DEVICE_PRIVATE */
|
||||
|
||||
static int __init strict_iomem(char *str)
|
||||
{
|
||||
if (strstr(str, "relaxed"))
|
||||
|
|
50
mm/Kconfig
50
mm/Kconfig
|
@ -670,47 +670,17 @@ config ZONE_DEVICE
|
|||
|
||||
If FS_DAX is enabled, then say Y.
|
||||
|
||||
config ARCH_HAS_HMM_MIRROR
|
||||
bool
|
||||
default y
|
||||
depends on (X86_64 || PPC64)
|
||||
depends on MMU && 64BIT
|
||||
|
||||
config ARCH_HAS_HMM_DEVICE
|
||||
bool
|
||||
default y
|
||||
depends on (X86_64 || PPC64)
|
||||
depends on MEMORY_HOTPLUG
|
||||
depends on MEMORY_HOTREMOVE
|
||||
depends on SPARSEMEM_VMEMMAP
|
||||
depends on ARCH_HAS_ZONE_DEVICE
|
||||
select XARRAY_MULTI
|
||||
|
||||
config ARCH_HAS_HMM
|
||||
bool
|
||||
default y
|
||||
depends on (X86_64 || PPC64)
|
||||
depends on ZONE_DEVICE
|
||||
depends on MMU && 64BIT
|
||||
depends on MEMORY_HOTPLUG
|
||||
depends on MEMORY_HOTREMOVE
|
||||
depends on SPARSEMEM_VMEMMAP
|
||||
|
||||
config MIGRATE_VMA_HELPER
|
||||
bool
|
||||
|
||||
config DEV_PAGEMAP_OPS
|
||||
bool
|
||||
|
||||
config HMM
|
||||
bool
|
||||
select MMU_NOTIFIER
|
||||
select MIGRATE_VMA_HELPER
|
||||
|
||||
config HMM_MIRROR
|
||||
bool "HMM mirror CPU page table into a device page table"
|
||||
depends on ARCH_HAS_HMM
|
||||
select HMM
|
||||
depends on (X86_64 || PPC64)
|
||||
depends on MMU && 64BIT
|
||||
select MMU_NOTIFIER
|
||||
help
|
||||
Select HMM_MIRROR if you want to mirror range of the CPU page table of a
|
||||
process into a device page table. Here, mirror means "keep synchronized".
|
||||
|
@ -720,8 +690,7 @@ config HMM_MIRROR
|
|||
|
||||
config DEVICE_PRIVATE
|
||||
bool "Unaddressable device memory (GPU memory, ...)"
|
||||
depends on ARCH_HAS_HMM
|
||||
select HMM
|
||||
depends on ZONE_DEVICE
|
||||
select DEV_PAGEMAP_OPS
|
||||
|
||||
help
|
||||
|
@ -729,17 +698,6 @@ config DEVICE_PRIVATE
|
|||
memory; i.e., memory that is only accessible from the device (or
|
||||
group of devices). You likely also want to select HMM_MIRROR.
|
||||
|
||||
config DEVICE_PUBLIC
|
||||
bool "Addressable device memory (like GPU memory)"
|
||||
depends on ARCH_HAS_HMM
|
||||
select HMM
|
||||
select DEV_PAGEMAP_OPS
|
||||
|
||||
help
|
||||
Allows creation of struct pages to represent addressable device
|
||||
memory; i.e., memory that is accessible from both the device and
|
||||
the CPU
|
||||
|
||||
config FRAME_VECTOR
|
||||
bool
|
||||
|
||||
|
|
|
@ -102,5 +102,5 @@ obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
|
|||
obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
|
||||
obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
|
||||
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
|
||||
obj-$(CONFIG_HMM) += hmm.o
|
||||
obj-$(CONFIG_HMM_MIRROR) += hmm.o
|
||||
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
|
||||
|
|
7
mm/gup.c
7
mm/gup.c
|
@ -609,13 +609,6 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
|
|||
if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
|
||||
goto unmap;
|
||||
*page = pte_page(*pte);
|
||||
|
||||
/*
|
||||
* This should never happen (a device public page in the gate
|
||||
* area).
|
||||
*/
|
||||
if (is_device_public_page(*page))
|
||||
goto unmap;
|
||||
}
|
||||
if (unlikely(!try_get_page(*page))) {
|
||||
ret = -ENOMEM;
|
||||
|
|
653
mm/hmm.c
653
mm/hmm.c
|
@ -20,26 +20,14 @@
|
|||
#include <linux/swapops.h>
|
||||
#include <linux/hugetlb.h>
|
||||
#include <linux/memremap.h>
|
||||
#include <linux/sched/mm.h>
|
||||
#include <linux/jump_label.h>
|
||||
#include <linux/dma-mapping.h>
|
||||
#include <linux/mmu_notifier.h>
|
||||
#include <linux/memory_hotplug.h>
|
||||
|
||||
#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
|
||||
|
||||
#if IS_ENABLED(CONFIG_HMM_MIRROR)
|
||||
static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
|
||||
|
||||
static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
|
||||
{
|
||||
struct hmm *hmm = READ_ONCE(mm->hmm);
|
||||
|
||||
if (hmm && kref_get_unless_zero(&hmm->kref))
|
||||
return hmm;
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/**
|
||||
* hmm_get_or_create - register HMM against an mm (HMM internal)
|
||||
*
|
||||
|
@ -54,11 +42,16 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
|
|||
*/
|
||||
static struct hmm *hmm_get_or_create(struct mm_struct *mm)
|
||||
{
|
||||
struct hmm *hmm = mm_get_hmm(mm);
|
||||
bool cleanup = false;
|
||||
struct hmm *hmm;
|
||||
|
||||
if (hmm)
|
||||
return hmm;
|
||||
lockdep_assert_held_write(&mm->mmap_sem);
|
||||
|
||||
/* Abuse the page_table_lock to also protect mm->hmm. */
|
||||
spin_lock(&mm->page_table_lock);
|
||||
hmm = mm->hmm;
|
||||
if (mm->hmm && kref_get_unless_zero(&mm->hmm->kref))
|
||||
goto out_unlock;
|
||||
spin_unlock(&mm->page_table_lock);
|
||||
|
||||
hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
|
||||
if (!hmm)
|
||||
|
@ -68,55 +61,50 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
|
|||
init_rwsem(&hmm->mirrors_sem);
|
||||
hmm->mmu_notifier.ops = NULL;
|
||||
INIT_LIST_HEAD(&hmm->ranges);
|
||||
mutex_init(&hmm->lock);
|
||||
spin_lock_init(&hmm->ranges_lock);
|
||||
kref_init(&hmm->kref);
|
||||
hmm->notifiers = 0;
|
||||
hmm->dead = false;
|
||||
hmm->mm = mm;
|
||||
|
||||
spin_lock(&mm->page_table_lock);
|
||||
if (!mm->hmm)
|
||||
mm->hmm = hmm;
|
||||
else
|
||||
cleanup = true;
|
||||
spin_unlock(&mm->page_table_lock);
|
||||
hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
|
||||
if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
|
||||
kfree(hmm);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
if (cleanup)
|
||||
goto error;
|
||||
mmgrab(hmm->mm);
|
||||
|
||||
/*
|
||||
* We should only get here if hold the mmap_sem in write mode ie on
|
||||
* registration of first mirror through hmm_mirror_register()
|
||||
* We hold the exclusive mmap_sem here so we know that mm->hmm is
|
||||
* still NULL or 0 kref, and is safe to update.
|
||||
*/
|
||||
hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
|
||||
if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
|
||||
goto error_mm;
|
||||
|
||||
return hmm;
|
||||
|
||||
error_mm:
|
||||
spin_lock(&mm->page_table_lock);
|
||||
if (mm->hmm == hmm)
|
||||
mm->hmm = NULL;
|
||||
mm->hmm = hmm;
|
||||
|
||||
out_unlock:
|
||||
spin_unlock(&mm->page_table_lock);
|
||||
error:
|
||||
return hmm;
|
||||
}
|
||||
|
||||
static void hmm_free_rcu(struct rcu_head *rcu)
|
||||
{
|
||||
struct hmm *hmm = container_of(rcu, struct hmm, rcu);
|
||||
|
||||
mmdrop(hmm->mm);
|
||||
kfree(hmm);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static void hmm_free(struct kref *kref)
|
||||
{
|
||||
struct hmm *hmm = container_of(kref, struct hmm, kref);
|
||||
struct mm_struct *mm = hmm->mm;
|
||||
|
||||
mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
|
||||
spin_lock(&hmm->mm->page_table_lock);
|
||||
if (hmm->mm->hmm == hmm)
|
||||
hmm->mm->hmm = NULL;
|
||||
spin_unlock(&hmm->mm->page_table_lock);
|
||||
|
||||
spin_lock(&mm->page_table_lock);
|
||||
if (mm->hmm == hmm)
|
||||
mm->hmm = NULL;
|
||||
spin_unlock(&mm->page_table_lock);
|
||||
|
||||
kfree(hmm);
|
||||
mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
|
||||
mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
|
||||
}
|
||||
|
||||
static inline void hmm_put(struct hmm *hmm)
|
||||
|
@ -124,126 +112,40 @@ static inline void hmm_put(struct hmm *hmm)
|
|||
kref_put(&hmm->kref, hmm_free);
|
||||
}
|
||||
|
||||
void hmm_mm_destroy(struct mm_struct *mm)
|
||||
{
|
||||
struct hmm *hmm;
|
||||
|
||||
spin_lock(&mm->page_table_lock);
|
||||
hmm = mm_get_hmm(mm);
|
||||
mm->hmm = NULL;
|
||||
if (hmm) {
|
||||
hmm->mm = NULL;
|
||||
hmm->dead = true;
|
||||
spin_unlock(&mm->page_table_lock);
|
||||
hmm_put(hmm);
|
||||
return;
|
||||
}
|
||||
|
||||
spin_unlock(&mm->page_table_lock);
|
||||
}
|
||||
|
||||
static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
|
||||
{
|
||||
struct hmm *hmm = mm_get_hmm(mm);
|
||||
struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
|
||||
struct hmm_mirror *mirror;
|
||||
struct hmm_range *range;
|
||||
|
||||
/* Report this HMM as dying. */
|
||||
hmm->dead = true;
|
||||
/* Bail out if hmm is in the process of being freed */
|
||||
if (!kref_get_unless_zero(&hmm->kref))
|
||||
return;
|
||||
|
||||
/* Wake-up everyone waiting on any range. */
|
||||
mutex_lock(&hmm->lock);
|
||||
list_for_each_entry(range, &hmm->ranges, list) {
|
||||
range->valid = false;
|
||||
}
|
||||
wake_up_all(&hmm->wq);
|
||||
mutex_unlock(&hmm->lock);
|
||||
/*
|
||||
* Since hmm_range_register() holds the mmget() lock hmm_release() is
|
||||
* prevented as long as a range exists.
|
||||
*/
|
||||
WARN_ON(!list_empty_careful(&hmm->ranges));
|
||||
|
||||
down_write(&hmm->mirrors_sem);
|
||||
mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
|
||||
list);
|
||||
while (mirror) {
|
||||
list_del_init(&mirror->list);
|
||||
if (mirror->ops->release) {
|
||||
/*
|
||||
* Drop mirrors_sem so callback can wait on any pending
|
||||
* work that might itself trigger mmu_notifier callback
|
||||
* and thus would deadlock with us.
|
||||
*/
|
||||
up_write(&hmm->mirrors_sem);
|
||||
mirror->ops->release(mirror);
|
||||
down_write(&hmm->mirrors_sem);
|
||||
}
|
||||
mirror = list_first_entry_or_null(&hmm->mirrors,
|
||||
struct hmm_mirror, list);
|
||||
}
|
||||
up_write(&hmm->mirrors_sem);
|
||||
|
||||
hmm_put(hmm);
|
||||
}
|
||||
|
||||
static int hmm_invalidate_range_start(struct mmu_notifier *mn,
|
||||
const struct mmu_notifier_range *nrange)
|
||||
{
|
||||
struct hmm *hmm = mm_get_hmm(nrange->mm);
|
||||
struct hmm_mirror *mirror;
|
||||
struct hmm_update update;
|
||||
struct hmm_range *range;
|
||||
int ret = 0;
|
||||
|
||||
VM_BUG_ON(!hmm);
|
||||
|
||||
update.start = nrange->start;
|
||||
update.end = nrange->end;
|
||||
update.event = HMM_UPDATE_INVALIDATE;
|
||||
update.blockable = mmu_notifier_range_blockable(nrange);
|
||||
|
||||
if (mmu_notifier_range_blockable(nrange))
|
||||
mutex_lock(&hmm->lock);
|
||||
else if (!mutex_trylock(&hmm->lock)) {
|
||||
ret = -EAGAIN;
|
||||
goto out;
|
||||
}
|
||||
hmm->notifiers++;
|
||||
list_for_each_entry(range, &hmm->ranges, list) {
|
||||
if (update.end < range->start || update.start >= range->end)
|
||||
continue;
|
||||
|
||||
range->valid = false;
|
||||
}
|
||||
mutex_unlock(&hmm->lock);
|
||||
|
||||
if (mmu_notifier_range_blockable(nrange))
|
||||
down_read(&hmm->mirrors_sem);
|
||||
else if (!down_read_trylock(&hmm->mirrors_sem)) {
|
||||
ret = -EAGAIN;
|
||||
goto out;
|
||||
}
|
||||
down_read(&hmm->mirrors_sem);
|
||||
list_for_each_entry(mirror, &hmm->mirrors, list) {
|
||||
int ret;
|
||||
|
||||
ret = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
|
||||
if (!update.blockable && ret == -EAGAIN) {
|
||||
up_read(&hmm->mirrors_sem);
|
||||
ret = -EAGAIN;
|
||||
goto out;
|
||||
}
|
||||
/*
|
||||
* Note: The driver is not allowed to trigger
|
||||
* hmm_mirror_unregister() from this thread.
|
||||
*/
|
||||
if (mirror->ops->release)
|
||||
mirror->ops->release(mirror);
|
||||
}
|
||||
up_read(&hmm->mirrors_sem);
|
||||
|
||||
out:
|
||||
hmm_put(hmm);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void hmm_invalidate_range_end(struct mmu_notifier *mn,
|
||||
const struct mmu_notifier_range *nrange)
|
||||
static void notifiers_decrement(struct hmm *hmm)
|
||||
{
|
||||
struct hmm *hmm = mm_get_hmm(nrange->mm);
|
||||
unsigned long flags;
|
||||
|
||||
VM_BUG_ON(!hmm);
|
||||
|
||||
mutex_lock(&hmm->lock);
|
||||
spin_lock_irqsave(&hmm->ranges_lock, flags);
|
||||
hmm->notifiers--;
|
||||
if (!hmm->notifiers) {
|
||||
struct hmm_range *range;
|
||||
|
@ -255,8 +157,73 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
|
|||
}
|
||||
wake_up_all(&hmm->wq);
|
||||
}
|
||||
mutex_unlock(&hmm->lock);
|
||||
spin_unlock_irqrestore(&hmm->ranges_lock, flags);
|
||||
}
|
||||
|
||||
static int hmm_invalidate_range_start(struct mmu_notifier *mn,
|
||||
const struct mmu_notifier_range *nrange)
|
||||
{
|
||||
struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
|
||||
struct hmm_mirror *mirror;
|
||||
struct hmm_update update;
|
||||
struct hmm_range *range;
|
||||
unsigned long flags;
|
||||
int ret = 0;
|
||||
|
||||
if (!kref_get_unless_zero(&hmm->kref))
|
||||
return 0;
|
||||
|
||||
update.start = nrange->start;
|
||||
update.end = nrange->end;
|
||||
update.event = HMM_UPDATE_INVALIDATE;
|
||||
update.blockable = mmu_notifier_range_blockable(nrange);
|
||||
|
||||
spin_lock_irqsave(&hmm->ranges_lock, flags);
|
||||
hmm->notifiers++;
|
||||
list_for_each_entry(range, &hmm->ranges, list) {
|
||||
if (update.end < range->start || update.start >= range->end)
|
||||
continue;
|
||||
|
||||
range->valid = false;
|
||||
}
|
||||
spin_unlock_irqrestore(&hmm->ranges_lock, flags);
|
||||
|
||||
if (mmu_notifier_range_blockable(nrange))
|
||||
down_read(&hmm->mirrors_sem);
|
||||
else if (!down_read_trylock(&hmm->mirrors_sem)) {
|
||||
ret = -EAGAIN;
|
||||
goto out;
|
||||
}
|
||||
|
||||
list_for_each_entry(mirror, &hmm->mirrors, list) {
|
||||
int rc;
|
||||
|
||||
rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
|
||||
if (rc) {
|
||||
if (WARN_ON(update.blockable || rc != -EAGAIN))
|
||||
continue;
|
||||
ret = -EAGAIN;
|
||||
break;
|
||||
}
|
||||
}
|
||||
up_read(&hmm->mirrors_sem);
|
||||
|
||||
out:
|
||||
if (ret)
|
||||
notifiers_decrement(hmm);
|
||||
hmm_put(hmm);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void hmm_invalidate_range_end(struct mmu_notifier *mn,
|
||||
const struct mmu_notifier_range *nrange)
|
||||
{
|
||||
struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
|
||||
|
||||
if (!kref_get_unless_zero(&hmm->kref))
|
||||
return;
|
||||
|
||||
notifiers_decrement(hmm);
|
||||
hmm_put(hmm);
|
||||
}
|
||||
|
||||
|
@ -271,14 +238,15 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
|
|||
*
|
||||
* @mirror: new mirror struct to register
|
||||
* @mm: mm to register against
|
||||
* Return: 0 on success, -ENOMEM if no memory, -EINVAL if invalid arguments
|
||||
*
|
||||
* To start mirroring a process address space, the device driver must register
|
||||
* an HMM mirror struct.
|
||||
*
|
||||
* THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
|
||||
*/
|
||||
int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
|
||||
{
|
||||
lockdep_assert_held_write(&mm->mmap_sem);
|
||||
|
||||
/* Sanity check */
|
||||
if (!mm || !mirror || !mirror->ops)
|
||||
return -EINVAL;
|
||||
|
@ -298,23 +266,17 @@ EXPORT_SYMBOL(hmm_mirror_register);
|
|||
/*
|
||||
* hmm_mirror_unregister() - unregister a mirror
|
||||
*
|
||||
* @mirror: new mirror struct to register
|
||||
* @mirror: mirror struct to unregister
|
||||
*
|
||||
* Stop mirroring a process address space, and cleanup.
|
||||
*/
|
||||
void hmm_mirror_unregister(struct hmm_mirror *mirror)
|
||||
{
|
||||
struct hmm *hmm = READ_ONCE(mirror->hmm);
|
||||
|
||||
if (hmm == NULL)
|
||||
return;
|
||||
struct hmm *hmm = mirror->hmm;
|
||||
|
||||
down_write(&hmm->mirrors_sem);
|
||||
list_del_init(&mirror->list);
|
||||
/* To protect us against double unregister ... */
|
||||
mirror->hmm = NULL;
|
||||
list_del(&mirror->list);
|
||||
up_write(&hmm->mirrors_sem);
|
||||
|
||||
hmm_put(hmm);
|
||||
}
|
||||
EXPORT_SYMBOL(hmm_mirror_unregister);
|
||||
|
@ -330,7 +292,7 @@ struct hmm_vma_walk {
|
|||
static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
|
||||
bool write_fault, uint64_t *pfn)
|
||||
{
|
||||
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
|
||||
unsigned int flags = FAULT_FLAG_REMOTE;
|
||||
struct hmm_vma_walk *hmm_vma_walk = walk->private;
|
||||
struct hmm_range *range = hmm_vma_walk->range;
|
||||
struct vm_area_struct *vma = walk->vma;
|
||||
|
@ -372,7 +334,7 @@ static int hmm_pfns_bad(unsigned long addr,
|
|||
* @fault: should we fault or not ?
|
||||
* @write_fault: write fault ?
|
||||
* @walk: mm_walk structure
|
||||
* Returns: 0 on success, -EBUSY after page fault, or page fault error
|
||||
* Return: 0 on success, -EBUSY after page fault, or page fault error
|
||||
*
|
||||
* This function will be called whenever pmd_none() or pte_none() returns true,
|
||||
* or whenever there is no page directory covering the virtual address range.
|
||||
|
@ -550,7 +512,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
|
|||
|
||||
static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
|
||||
{
|
||||
if (pte_none(pte) || !pte_present(pte))
|
||||
if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
|
||||
return 0;
|
||||
return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
|
||||
range->flags[HMM_PFN_WRITE] :
|
||||
|
@ -788,7 +750,6 @@ again:
|
|||
return hmm_vma_walk_hole_(addr, end, fault,
|
||||
write_fault, walk);
|
||||
|
||||
#ifdef CONFIG_HUGETLB_PAGE
|
||||
pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
|
||||
for (i = 0; i < npages; ++i, ++pfn) {
|
||||
hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
|
||||
|
@ -804,9 +765,6 @@ again:
|
|||
}
|
||||
hmm_vma_walk->last = end;
|
||||
return 0;
|
||||
#else
|
||||
return -EINVAL;
|
||||
#endif
|
||||
}
|
||||
|
||||
split_huge_pud(walk->vma, pudp, addr);
|
||||
|
@ -909,12 +867,14 @@ static void hmm_pfns_clear(struct hmm_range *range,
|
|||
* Track updates to the CPU page table see include/linux/hmm.h
|
||||
*/
|
||||
int hmm_range_register(struct hmm_range *range,
|
||||
struct mm_struct *mm,
|
||||
struct hmm_mirror *mirror,
|
||||
unsigned long start,
|
||||
unsigned long end,
|
||||
unsigned page_shift)
|
||||
{
|
||||
unsigned long mask = ((1UL << page_shift) - 1UL);
|
||||
struct hmm *hmm = mirror->hmm;
|
||||
unsigned long flags;
|
||||
|
||||
range->valid = false;
|
||||
range->hmm = NULL;
|
||||
|
@ -928,28 +888,24 @@ int hmm_range_register(struct hmm_range *range,
|
|||
range->start = start;
|
||||
range->end = end;
|
||||
|
||||
range->hmm = hmm_get_or_create(mm);
|
||||
if (!range->hmm)
|
||||
/* Prevent hmm_release() from running while the range is valid */
|
||||
if (!mmget_not_zero(hmm->mm))
|
||||
return -EFAULT;
|
||||
|
||||
/* Check if hmm_mm_destroy() was call. */
|
||||
if (range->hmm->mm == NULL || range->hmm->dead) {
|
||||
hmm_put(range->hmm);
|
||||
return -EFAULT;
|
||||
}
|
||||
/* Initialize range to track CPU page table updates. */
|
||||
spin_lock_irqsave(&hmm->ranges_lock, flags);
|
||||
|
||||
/* Initialize range to track CPU page table update */
|
||||
mutex_lock(&range->hmm->lock);
|
||||
|
||||
list_add_rcu(&range->list, &range->hmm->ranges);
|
||||
range->hmm = hmm;
|
||||
kref_get(&hmm->kref);
|
||||
list_add(&range->list, &hmm->ranges);
|
||||
|
||||
/*
|
||||
* If there are any concurrent notifiers we have to wait for them for
|
||||
* the range to be valid (see hmm_range_wait_until_valid()).
|
||||
*/
|
||||
if (!range->hmm->notifiers)
|
||||
if (!hmm->notifiers)
|
||||
range->valid = true;
|
||||
mutex_unlock(&range->hmm->lock);
|
||||
spin_unlock_irqrestore(&hmm->ranges_lock, flags);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
@ -964,25 +920,31 @@ EXPORT_SYMBOL(hmm_range_register);
|
|||
*/
|
||||
void hmm_range_unregister(struct hmm_range *range)
|
||||
{
|
||||
/* Sanity check this really should not happen. */
|
||||
if (range->hmm == NULL || range->end <= range->start)
|
||||
return;
|
||||
struct hmm *hmm = range->hmm;
|
||||
unsigned long flags;
|
||||
|
||||
mutex_lock(&range->hmm->lock);
|
||||
list_del_rcu(&range->list);
|
||||
mutex_unlock(&range->hmm->lock);
|
||||
spin_lock_irqsave(&hmm->ranges_lock, flags);
|
||||
list_del_init(&range->list);
|
||||
spin_unlock_irqrestore(&hmm->ranges_lock, flags);
|
||||
|
||||
/* Drop reference taken by hmm_range_register() */
|
||||
mmput(hmm->mm);
|
||||
hmm_put(hmm);
|
||||
|
||||
/*
|
||||
* The range is now invalid and the ref on the hmm is dropped, so
|
||||
* poison the pointer. Leave other fields in place, for the caller's
|
||||
* use.
|
||||
*/
|
||||
range->valid = false;
|
||||
hmm_put(range->hmm);
|
||||
range->hmm = NULL;
|
||||
memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
|
||||
}
|
||||
EXPORT_SYMBOL(hmm_range_unregister);
|
||||
|
||||
/*
|
||||
* hmm_range_snapshot() - snapshot CPU page table for a range
|
||||
* @range: range
|
||||
* Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
|
||||
* Return: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
|
||||
* permission (for instance asking for write and range is read only),
|
||||
* -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
|
||||
* vma or it is illegal to access that range), number of valid pages
|
||||
|
@ -1001,10 +963,7 @@ long hmm_range_snapshot(struct hmm_range *range)
|
|||
struct vm_area_struct *vma;
|
||||
struct mm_walk mm_walk;
|
||||
|
||||
/* Check if hmm_mm_destroy() was call. */
|
||||
if (hmm->mm == NULL || hmm->dead)
|
||||
return -EFAULT;
|
||||
|
||||
lockdep_assert_held(&hmm->mm->mmap_sem);
|
||||
do {
|
||||
/* If range is no longer valid force retry. */
|
||||
if (!range->valid)
|
||||
|
@ -1015,9 +974,8 @@ long hmm_range_snapshot(struct hmm_range *range)
|
|||
return -EFAULT;
|
||||
|
||||
if (is_vm_hugetlb_page(vma)) {
|
||||
struct hstate *h = hstate_vma(vma);
|
||||
|
||||
if (huge_page_shift(h) != range->page_shift &&
|
||||
if (huge_page_shift(hstate_vma(vma)) !=
|
||||
range->page_shift &&
|
||||
range->page_shift != PAGE_SHIFT)
|
||||
return -EINVAL;
|
||||
} else {
|
||||
|
@ -1066,7 +1024,7 @@ EXPORT_SYMBOL(hmm_range_snapshot);
|
|||
* hmm_range_fault() - try to fault some address in a virtual address range
|
||||
* @range: range being faulted
|
||||
* @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
|
||||
* Returns: number of valid pages in range->pfns[] (from range start
|
||||
* Return: number of valid pages in range->pfns[] (from range start
|
||||
* address). This may be zero. If the return value is negative,
|
||||
* then one of the following values may be returned:
|
||||
*
|
||||
|
@ -1100,9 +1058,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
|
|||
struct mm_walk mm_walk;
|
||||
int ret;
|
||||
|
||||
/* Check if hmm_mm_destroy() was call. */
|
||||
if (hmm->mm == NULL || hmm->dead)
|
||||
return -EFAULT;
|
||||
lockdep_assert_held(&hmm->mm->mmap_sem);
|
||||
|
||||
do {
|
||||
/* If range is no longer valid force retry. */
|
||||
|
@ -1184,7 +1140,7 @@ EXPORT_SYMBOL(hmm_range_fault);
|
|||
* @device: device against to dma map page to
|
||||
* @daddrs: dma address of mapped pages
|
||||
* @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
|
||||
* Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
|
||||
* Return: number of pages mapped on success, -EAGAIN if mmap_sem have been
|
||||
* drop and you need to try again, some other error value otherwise
|
||||
*
|
||||
* Note same usage pattern as hmm_range_fault().
|
||||
|
@ -1272,7 +1228,7 @@ EXPORT_SYMBOL(hmm_range_dma_map);
|
|||
* @device: device against which dma map was done
|
||||
* @daddrs: dma address of mapped pages
|
||||
* @dirty: dirty page if it had the write flag set
|
||||
* Returns: number of page unmapped on success, -EINVAL otherwise
|
||||
* Return: number of page unmapped on success, -EINVAL otherwise
|
||||
*
|
||||
* Note that caller MUST abide by mmu notifier or use HMM mirror and abide
|
||||
* to the sync_cpu_device_pagetables() callback so that it is safe here to
|
||||
|
@ -1328,284 +1284,3 @@ long hmm_range_dma_unmap(struct hmm_range *range,
|
|||
return cpages;
|
||||
}
|
||||
EXPORT_SYMBOL(hmm_range_dma_unmap);
|
||||
#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
|
||||
|
||||
|
||||
#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
|
||||
struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
|
||||
unsigned long addr)
|
||||
{
|
||||
struct page *page;
|
||||
|
||||
page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
|
||||
if (!page)
|
||||
return NULL;
|
||||
lock_page(page);
|
||||
return page;
|
||||
}
|
||||
EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
|
||||
|
||||
|
||||
static void hmm_devmem_ref_release(struct percpu_ref *ref)
|
||||
{
|
||||
struct hmm_devmem *devmem;
|
||||
|
||||
devmem = container_of(ref, struct hmm_devmem, ref);
|
||||
complete(&devmem->completion);
|
||||
}
|
||||
|
||||
static void hmm_devmem_ref_exit(struct percpu_ref *ref)
|
||||
{
|
||||
struct hmm_devmem *devmem;
|
||||
|
||||
devmem = container_of(ref, struct hmm_devmem, ref);
|
||||
wait_for_completion(&devmem->completion);
|
||||
percpu_ref_exit(ref);
|
||||
}
|
||||
|
||||
static void hmm_devmem_ref_kill(struct percpu_ref *ref)
|
||||
{
|
||||
percpu_ref_kill(ref);
|
||||
}
|
||||
|
||||
static vm_fault_t hmm_devmem_fault(struct vm_area_struct *vma,
|
||||
unsigned long addr,
|
||||
const struct page *page,
|
||||
unsigned int flags,
|
||||
pmd_t *pmdp)
|
||||
{
|
||||
struct hmm_devmem *devmem = page->pgmap->data;
|
||||
|
||||
return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
|
||||
}
|
||||
|
||||
static void hmm_devmem_free(struct page *page, void *data)
|
||||
{
|
||||
struct hmm_devmem *devmem = data;
|
||||
|
||||
page->mapping = NULL;
|
||||
|
||||
devmem->ops->free(devmem, page);
|
||||
}
|
||||
|
||||
/*
|
||||
* hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
|
||||
*
|
||||
* @ops: memory event device driver callback (see struct hmm_devmem_ops)
|
||||
* @device: device struct to bind the resource too
|
||||
* @size: size in bytes of the device memory to add
|
||||
* Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
|
||||
*
|
||||
* This function first finds an empty range of physical address big enough to
|
||||
* contain the new resource, and then hotplugs it as ZONE_DEVICE memory, which
|
||||
* in turn allocates struct pages. It does not do anything beyond that; all
|
||||
* events affecting the memory will go through the various callbacks provided
|
||||
* by hmm_devmem_ops struct.
|
||||
*
|
||||
* Device driver should call this function during device initialization and
|
||||
* is then responsible of memory management. HMM only provides helpers.
|
||||
*/
|
||||
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
|
||||
struct device *device,
|
||||
unsigned long size)
|
||||
{
|
||||
struct hmm_devmem *devmem;
|
||||
resource_size_t addr;
|
||||
void *result;
|
||||
int ret;
|
||||
|
||||
dev_pagemap_get_ops();
|
||||
|
||||
devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
|
||||
if (!devmem)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
init_completion(&devmem->completion);
|
||||
devmem->pfn_first = -1UL;
|
||||
devmem->pfn_last = -1UL;
|
||||
devmem->resource = NULL;
|
||||
devmem->device = device;
|
||||
devmem->ops = ops;
|
||||
|
||||
ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
|
||||
0, GFP_KERNEL);
|
||||
if (ret)
|
||||
return ERR_PTR(ret);
|
||||
|
||||
size = ALIGN(size, PA_SECTION_SIZE);
|
||||
addr = min((unsigned long)iomem_resource.end,
|
||||
(1UL << MAX_PHYSMEM_BITS) - 1);
|
||||
addr = addr - size + 1UL;
|
||||
|
||||
/*
|
||||
* FIXME add a new helper to quickly walk resource tree and find free
|
||||
* range
|
||||
*
|
||||
* FIXME what about ioport_resource resource ?
|
||||
*/
|
||||
for (; addr > size && addr >= iomem_resource.start; addr -= size) {
|
||||
ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
|
||||
if (ret != REGION_DISJOINT)
|
||||
continue;
|
||||
|
||||
devmem->resource = devm_request_mem_region(device, addr, size,
|
||||
dev_name(device));
|
||||
if (!devmem->resource)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
break;
|
||||
}
|
||||
if (!devmem->resource)
|
||||
return ERR_PTR(-ERANGE);
|
||||
|
||||
devmem->resource->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
|
||||
devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
|
||||
devmem->pfn_last = devmem->pfn_first +
|
||||
(resource_size(devmem->resource) >> PAGE_SHIFT);
|
||||
devmem->page_fault = hmm_devmem_fault;
|
||||
|
||||
devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
|
||||
devmem->pagemap.res = *devmem->resource;
|
||||
devmem->pagemap.page_free = hmm_devmem_free;
|
||||
devmem->pagemap.altmap_valid = false;
|
||||
devmem->pagemap.ref = &devmem->ref;
|
||||
devmem->pagemap.data = devmem;
|
||||
devmem->pagemap.kill = hmm_devmem_ref_kill;
|
||||
devmem->pagemap.cleanup = hmm_devmem_ref_exit;
|
||||
|
||||
result = devm_memremap_pages(devmem->device, &devmem->pagemap);
|
||||
if (IS_ERR(result))
|
||||
return result;
|
||||
return devmem;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(hmm_devmem_add);
|
||||
|
||||
struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
|
||||
struct device *device,
|
||||
struct resource *res)
|
||||
{
|
||||
struct hmm_devmem *devmem;
|
||||
void *result;
|
||||
int ret;
|
||||
|
||||
if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
|
||||
return ERR_PTR(-EINVAL);
|
||||
|
||||
dev_pagemap_get_ops();
|
||||
|
||||
devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
|
||||
if (!devmem)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
init_completion(&devmem->completion);
|
||||
devmem->pfn_first = -1UL;
|
||||
devmem->pfn_last = -1UL;
|
||||
devmem->resource = res;
|
||||
devmem->device = device;
|
||||
devmem->ops = ops;
|
||||
|
||||
ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
|
||||
0, GFP_KERNEL);
|
||||
if (ret)
|
||||
return ERR_PTR(ret);
|
||||
|
||||
devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
|
||||
devmem->pfn_last = devmem->pfn_first +
|
||||
(resource_size(devmem->resource) >> PAGE_SHIFT);
|
||||
devmem->page_fault = hmm_devmem_fault;
|
||||
|
||||
devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
|
||||
devmem->pagemap.res = *devmem->resource;
|
||||
devmem->pagemap.page_free = hmm_devmem_free;
|
||||
devmem->pagemap.altmap_valid = false;
|
||||
devmem->pagemap.ref = &devmem->ref;
|
||||
devmem->pagemap.data = devmem;
|
||||
devmem->pagemap.kill = hmm_devmem_ref_kill;
|
||||
devmem->pagemap.cleanup = hmm_devmem_ref_exit;
|
||||
|
||||
result = devm_memremap_pages(devmem->device, &devmem->pagemap);
|
||||
if (IS_ERR(result))
|
||||
return result;
|
||||
return devmem;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
|
||||
|
||||
/*
|
||||
* A device driver that wants to handle multiple devices memory through a
|
||||
* single fake device can use hmm_device to do so. This is purely a helper
|
||||
* and it is not needed to make use of any HMM functionality.
|
||||
*/
|
||||
#define HMM_DEVICE_MAX 256
|
||||
|
||||
static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
|
||||
static DEFINE_SPINLOCK(hmm_device_lock);
|
||||
static struct class *hmm_device_class;
|
||||
static dev_t hmm_device_devt;
|
||||
|
||||
static void hmm_device_release(struct device *device)
|
||||
{
|
||||
struct hmm_device *hmm_device;
|
||||
|
||||
hmm_device = container_of(device, struct hmm_device, device);
|
||||
spin_lock(&hmm_device_lock);
|
||||
clear_bit(hmm_device->minor, hmm_device_mask);
|
||||
spin_unlock(&hmm_device_lock);
|
||||
|
||||
kfree(hmm_device);
|
||||
}
|
||||
|
||||
struct hmm_device *hmm_device_new(void *drvdata)
|
||||
{
|
||||
struct hmm_device *hmm_device;
|
||||
|
||||
hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
|
||||
if (!hmm_device)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
spin_lock(&hmm_device_lock);
|
||||
hmm_device->minor = find_first_zero_bit(hmm_device_mask, HMM_DEVICE_MAX);
|
||||
if (hmm_device->minor >= HMM_DEVICE_MAX) {
|
||||
spin_unlock(&hmm_device_lock);
|
||||
kfree(hmm_device);
|
||||
return ERR_PTR(-EBUSY);
|
||||
}
|
||||
set_bit(hmm_device->minor, hmm_device_mask);
|
||||
spin_unlock(&hmm_device_lock);
|
||||
|
||||
dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
|
||||
hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
|
||||
hmm_device->minor);
|
||||
hmm_device->device.release = hmm_device_release;
|
||||
dev_set_drvdata(&hmm_device->device, drvdata);
|
||||
hmm_device->device.class = hmm_device_class;
|
||||
device_initialize(&hmm_device->device);
|
||||
|
||||
return hmm_device;
|
||||
}
|
||||
EXPORT_SYMBOL(hmm_device_new);
|
||||
|
||||
void hmm_device_put(struct hmm_device *hmm_device)
|
||||
{
|
||||
put_device(&hmm_device->device);
|
||||
}
|
||||
EXPORT_SYMBOL(hmm_device_put);
|
||||
|
||||
static int __init hmm_init(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
ret = alloc_chrdev_region(&hmm_device_devt, 0,
|
||||
HMM_DEVICE_MAX,
|
||||
"hmm_device");
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
hmm_device_class = class_create(THIS_MODULE, "hmm_device");
|
||||
if (IS_ERR(hmm_device_class)) {
|
||||
unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
|
||||
return PTR_ERR(hmm_device_class);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
device_initcall(hmm_init);
|
||||
#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
|
||||
|
|
|
@ -354,7 +354,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
|
|||
continue;
|
||||
}
|
||||
|
||||
page = _vm_normal_page(vma, addr, ptent, true);
|
||||
page = vm_normal_page(vma, addr, ptent);
|
||||
if (!page)
|
||||
continue;
|
||||
|
||||
|
|
|
@ -4908,7 +4908,7 @@ enum mc_target_type {
|
|||
static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
|
||||
unsigned long addr, pte_t ptent)
|
||||
{
|
||||
struct page *page = _vm_normal_page(vma, addr, ptent, true);
|
||||
struct page *page = vm_normal_page(vma, addr, ptent);
|
||||
|
||||
if (!page || !page_mapped(page))
|
||||
return NULL;
|
||||
|
@ -5109,8 +5109,8 @@ out:
|
|||
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
|
||||
* target for charge migration. if @target is not NULL, the entry is stored
|
||||
* in target->ent.
|
||||
* 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PUBLIC
|
||||
* or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
|
||||
* 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PRIVATE
|
||||
* (so ZONE_DEVICE page and thus not on the lru).
|
||||
* For now we such page is charge like a regular page would be as for all
|
||||
* intent and purposes it is just special memory taking the place of a
|
||||
* regular page.
|
||||
|
@ -5144,8 +5144,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
|
|||
*/
|
||||
if (page->mem_cgroup == mc.from) {
|
||||
ret = MC_TARGET_PAGE;
|
||||
if (is_device_private_page(page) ||
|
||||
is_device_public_page(page))
|
||||
if (is_device_private_page(page))
|
||||
ret = MC_TARGET_DEVICE;
|
||||
if (target)
|
||||
target->page = page;
|
||||
|
@ -5216,8 +5215,8 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
|
|||
if (ptl) {
|
||||
/*
|
||||
* Note their can not be MC_TARGET_DEVICE for now as we do not
|
||||
* support transparent huge page with MEMORY_DEVICE_PUBLIC or
|
||||
* MEMORY_DEVICE_PRIVATE but this might change.
|
||||
* support transparent huge page with MEMORY_DEVICE_PRIVATE but
|
||||
* this might change.
|
||||
*/
|
||||
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
|
||||
mc.precharge += HPAGE_PMD_NR;
|
||||
|
|
|
@ -1177,16 +1177,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
|
|||
goto unlock;
|
||||
}
|
||||
|
||||
switch (pgmap->type) {
|
||||
case MEMORY_DEVICE_PRIVATE:
|
||||
case MEMORY_DEVICE_PUBLIC:
|
||||
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
|
||||
/*
|
||||
* TODO: Handle HMM pages which may need coordination
|
||||
* with device-side memory.
|
||||
*/
|
||||
goto unlock;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
/*
|
||||
|
|
49
mm/memory.c
49
mm/memory.c
|
@ -571,8 +571,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
|
|||
* PFNMAP mappings in order to support COWable mappings.
|
||||
*
|
||||
*/
|
||||
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t pte, bool with_public_device)
|
||||
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t pte)
|
||||
{
|
||||
unsigned long pfn = pte_pfn(pte);
|
||||
|
||||
|
@ -585,29 +585,6 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
|
|||
return NULL;
|
||||
if (is_zero_pfn(pfn))
|
||||
return NULL;
|
||||
|
||||
/*
|
||||
* Device public pages are special pages (they are ZONE_DEVICE
|
||||
* pages but different from persistent memory). They behave
|
||||
* allmost like normal pages. The difference is that they are
|
||||
* not on the lru and thus should never be involve with any-
|
||||
* thing that involve lru manipulation (mlock, numa balancing,
|
||||
* ...).
|
||||
*
|
||||
* This is why we still want to return NULL for such page from
|
||||
* vm_normal_page() so that we do not have to special case all
|
||||
* call site of vm_normal_page().
|
||||
*/
|
||||
if (likely(pfn <= highest_memmap_pfn)) {
|
||||
struct page *page = pfn_to_page(pfn);
|
||||
|
||||
if (is_device_public_page(page)) {
|
||||
if (with_public_device)
|
||||
return page;
|
||||
return NULL;
|
||||
}
|
||||
}
|
||||
|
||||
if (pte_devmap(pte))
|
||||
return NULL;
|
||||
|
||||
|
@ -797,17 +774,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
|
|||
rss[mm_counter(page)]++;
|
||||
} else if (pte_devmap(pte)) {
|
||||
page = pte_page(pte);
|
||||
|
||||
/*
|
||||
* Cache coherent device memory behave like regular page and
|
||||
* not like persistent memory page. For more informations see
|
||||
* MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
|
||||
*/
|
||||
if (is_device_public_page(page)) {
|
||||
get_page(page);
|
||||
page_dup_rmap(page, false);
|
||||
rss[mm_counter(page)]++;
|
||||
}
|
||||
}
|
||||
|
||||
out_set_pte:
|
||||
|
@ -1063,7 +1029,7 @@ again:
|
|||
if (pte_present(ptent)) {
|
||||
struct page *page;
|
||||
|
||||
page = _vm_normal_page(vma, addr, ptent, true);
|
||||
page = vm_normal_page(vma, addr, ptent);
|
||||
if (unlikely(details) && page) {
|
||||
/*
|
||||
* unmap_shared_mapping_pages() wants to
|
||||
|
@ -2777,13 +2743,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
|
|||
migration_entry_wait(vma->vm_mm, vmf->pmd,
|
||||
vmf->address);
|
||||
} else if (is_device_private_entry(entry)) {
|
||||
/*
|
||||
* For un-addressable device memory we call the pgmap
|
||||
* fault handler callback. The callback must migrate
|
||||
* the page back to some CPU accessible page.
|
||||
*/
|
||||
ret = device_private_entry_fault(vma, vmf->address, entry,
|
||||
vmf->flags, vmf->pmd);
|
||||
vmf->page = device_private_entry_to_page(entry);
|
||||
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
|
||||
} else if (is_hwpoison_entry(entry)) {
|
||||
ret = VM_FAULT_HWPOISON;
|
||||
} else {
|
||||
|
|
|
@ -557,10 +557,8 @@ void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
|
|||
int sections_to_remove;
|
||||
|
||||
/* In the ZONE_DEVICE case device driver owns the memory region */
|
||||
if (is_dev_zone(zone)) {
|
||||
if (altmap)
|
||||
map_offset = vmem_altmap_offset(altmap);
|
||||
}
|
||||
if (is_dev_zone(zone))
|
||||
map_offset = vmem_altmap_offset(altmap);
|
||||
|
||||
clear_zone_contiguous(zone);
|
||||
|
||||
|
|
|
@ -2098,6 +2098,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
|
|||
out:
|
||||
return page;
|
||||
}
|
||||
EXPORT_SYMBOL(alloc_pages_vma);
|
||||
|
||||
/**
|
||||
* alloc_pages_current - Allocate pages.
|
||||
|
|
28
mm/migrate.c
28
mm/migrate.c
|
@ -246,8 +246,6 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
|
|||
if (is_device_private_page(new)) {
|
||||
entry = make_device_private_entry(new, pte_write(pte));
|
||||
pte = swp_entry_to_pte(entry);
|
||||
} else if (is_device_public_page(new)) {
|
||||
pte = pte_mkdevmap(pte);
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -381,7 +379,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
|
|||
* ZONE_DEVICE pages.
|
||||
*/
|
||||
expected_count += is_device_private_page(page);
|
||||
expected_count += is_device_public_page(page);
|
||||
if (mapping)
|
||||
expected_count += hpage_nr_pages(page) + page_has_private(page);
|
||||
|
||||
|
@ -994,10 +991,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
|
|||
if (!PageMappingFlags(page))
|
||||
page->mapping = NULL;
|
||||
|
||||
if (unlikely(is_zone_device_page(newpage))) {
|
||||
if (is_device_public_page(newpage))
|
||||
flush_dcache_page(newpage);
|
||||
} else
|
||||
if (likely(!is_zone_device_page(newpage)))
|
||||
flush_dcache_page(newpage);
|
||||
|
||||
}
|
||||
|
@ -2265,7 +2259,7 @@ again:
|
|||
pfn = 0;
|
||||
goto next;
|
||||
}
|
||||
page = _vm_normal_page(migrate->vma, addr, pte, true);
|
||||
page = vm_normal_page(migrate->vma, addr, pte);
|
||||
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
|
||||
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
|
||||
}
|
||||
|
@ -2406,16 +2400,7 @@ static bool migrate_vma_check_page(struct page *page)
|
|||
* FIXME proper solution is to rework migration_entry_wait() so
|
||||
* it does not need to take a reference on page.
|
||||
*/
|
||||
if (is_device_private_page(page))
|
||||
return true;
|
||||
|
||||
/*
|
||||
* Only allow device public page to be migrated and account for
|
||||
* the extra reference count imply by ZONE_DEVICE pages.
|
||||
*/
|
||||
if (!is_device_public_page(page))
|
||||
return false;
|
||||
extra++;
|
||||
return is_device_private_page(page);
|
||||
}
|
||||
|
||||
/* For file back page */
|
||||
|
@ -2665,11 +2650,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
|
|||
|
||||
swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
|
||||
entry = swp_entry_to_pte(swp_entry);
|
||||
} else if (is_device_public_page(page)) {
|
||||
entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
|
||||
if (vma->vm_flags & VM_WRITE)
|
||||
entry = pte_mkwrite(pte_mkdirty(entry));
|
||||
entry = pte_mkdevmap(entry);
|
||||
}
|
||||
} else {
|
||||
entry = mk_pte(page, vma->vm_page_prot);
|
||||
|
@ -2789,7 +2769,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
|
|||
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
|
||||
continue;
|
||||
}
|
||||
} else if (!is_device_public_page(newpage)) {
|
||||
} else {
|
||||
/*
|
||||
* Other types of ZONE_DEVICE page are not
|
||||
* supported.
|
||||
|
|
|
@ -5925,6 +5925,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
|
|||
{
|
||||
unsigned long pfn, end_pfn = start_pfn + size;
|
||||
struct pglist_data *pgdat = zone->zone_pgdat;
|
||||
struct vmem_altmap *altmap = pgmap_altmap(pgmap);
|
||||
unsigned long zone_idx = zone_idx(zone);
|
||||
unsigned long start = jiffies;
|
||||
int nid = pgdat->node_id;
|
||||
|
@ -5937,9 +5938,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
|
|||
* of the pages reserved for the memmap, so we can just jump to
|
||||
* the end of that region and start processing the device pages.
|
||||
*/
|
||||
if (pgmap->altmap_valid) {
|
||||
struct vmem_altmap *altmap = &pgmap->altmap;
|
||||
|
||||
if (altmap) {
|
||||
start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
|
||||
size = end_pfn - start_pfn;
|
||||
}
|
||||
|
@ -5959,12 +5958,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
|
|||
__SetPageReserved(page);
|
||||
|
||||
/*
|
||||
* ZONE_DEVICE pages union ->lru with a ->pgmap back
|
||||
* pointer and hmm_data. It is a bug if a ZONE_DEVICE
|
||||
* page is ever freed or placed on a driver-private list.
|
||||
* ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
|
||||
* and zone_device_data. It is a bug if a ZONE_DEVICE page is
|
||||
* ever freed or placed on a driver-private list.
|
||||
*/
|
||||
page->pgmap = pgmap;
|
||||
page->hmm_data = 0;
|
||||
page->zone_device_data = NULL;
|
||||
|
||||
/*
|
||||
* Mark the block movable so that blocks are reserved for
|
||||
|
|
13
mm/swap.c
13
mm/swap.c
|
@ -740,15 +740,20 @@ void release_pages(struct page **pages, int nr)
|
|||
if (is_huge_zero_page(page))
|
||||
continue;
|
||||
|
||||
/* Device public page can not be huge page */
|
||||
if (is_device_public_page(page)) {
|
||||
if (is_zone_device_page(page)) {
|
||||
if (locked_pgdat) {
|
||||
spin_unlock_irqrestore(&locked_pgdat->lru_lock,
|
||||
flags);
|
||||
locked_pgdat = NULL;
|
||||
}
|
||||
put_devmap_managed_page(page);
|
||||
continue;
|
||||
/*
|
||||
* ZONE_DEVICE pages that return 'false' from
|
||||
* put_devmap_managed_page() do not require special
|
||||
* processing, and instead, expect a call to
|
||||
* put_page_testzero().
|
||||
*/
|
||||
if (put_devmap_managed_page(page))
|
||||
continue;
|
||||
}
|
||||
|
||||
page = compound_head(page);
|
||||
|
|
|
@ -100,25 +100,60 @@ static void nfit_test_kill(void *_pgmap)
|
|||
{
|
||||
struct dev_pagemap *pgmap = _pgmap;
|
||||
|
||||
WARN_ON(!pgmap || !pgmap->ref || !pgmap->kill || !pgmap->cleanup);
|
||||
pgmap->kill(pgmap->ref);
|
||||
pgmap->cleanup(pgmap->ref);
|
||||
WARN_ON(!pgmap || !pgmap->ref);
|
||||
|
||||
if (pgmap->ops && pgmap->ops->kill)
|
||||
pgmap->ops->kill(pgmap);
|
||||
else
|
||||
percpu_ref_kill(pgmap->ref);
|
||||
|
||||
if (pgmap->ops && pgmap->ops->cleanup) {
|
||||
pgmap->ops->cleanup(pgmap);
|
||||
} else {
|
||||
wait_for_completion(&pgmap->done);
|
||||
percpu_ref_exit(pgmap->ref);
|
||||
}
|
||||
}
|
||||
|
||||
static void dev_pagemap_percpu_release(struct percpu_ref *ref)
|
||||
{
|
||||
struct dev_pagemap *pgmap =
|
||||
container_of(ref, struct dev_pagemap, internal_ref);
|
||||
|
||||
complete(&pgmap->done);
|
||||
}
|
||||
|
||||
void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
|
||||
{
|
||||
int error;
|
||||
resource_size_t offset = pgmap->res.start;
|
||||
struct nfit_test_resource *nfit_res = get_nfit_res(offset);
|
||||
|
||||
if (nfit_res) {
|
||||
int rc;
|
||||
if (!nfit_res)
|
||||
return devm_memremap_pages(dev, pgmap);
|
||||
|
||||
rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap);
|
||||
if (rc)
|
||||
return ERR_PTR(rc);
|
||||
return nfit_res->buf + offset - nfit_res->res.start;
|
||||
pgmap->dev = dev;
|
||||
if (!pgmap->ref) {
|
||||
if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup))
|
||||
return ERR_PTR(-EINVAL);
|
||||
|
||||
init_completion(&pgmap->done);
|
||||
error = percpu_ref_init(&pgmap->internal_ref,
|
||||
dev_pagemap_percpu_release, 0, GFP_KERNEL);
|
||||
if (error)
|
||||
return ERR_PTR(error);
|
||||
pgmap->ref = &pgmap->internal_ref;
|
||||
} else {
|
||||
if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) {
|
||||
WARN(1, "Missing reference count teardown definition\n");
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
}
|
||||
return devm_memremap_pages(dev, pgmap);
|
||||
|
||||
error = devm_add_action_or_reset(dev, nfit_test_kill, pgmap);
|
||||
if (error)
|
||||
return ERR_PTR(error);
|
||||
return nfit_res->buf + offset - nfit_res->res.start;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
|
||||
|
||||
|
|
Loading…
Reference in New Issue