1/ Device DAX for persistent memory:
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped
without need of an intervening file system. Device DAX is strict,
precise and predictable. Specifically this interface:
a) Guarantees fault granularity with respect to a given page size
(pte, pmd, or pud) set at configuration time.
b) Enforces deterministic behavior by being strict about what fault
scenarios are supported.
Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance/feature differentiated
memory ranges.
2/ Support for the HPE DSM (device specific method) command formats.
This enables management of these first generation devices until a
unified DSM specification materializes.
3/ Further ACPI 6.1 compliance with support for the common dimm
identifier format.
4/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQhdeAAoJEB7SkWpmfYgCYP8P/RAgHkroL5lUKKU45TQUBKcY
diC9POeNSccme4tIRIQCGQUZ7+7mKM5ECv2ulF4xYOHvFBCcd/8OF6xKAXs48r3v
oguYhvX1YvIkBc9FUfBQbR1IsCOJ7uWp/UYiYCIQEXS5tS9Jv545j3ASqDt9xWoV
TWlceZn3yWSbASiV9qZ2eXhEkk75pg4yara++rsm2/7rs/TTXn5EIjBs+57BtAo+
6utI4fTy0CQvBYwVzam3m7y9dt2Z2jWXL4hgmT7pkvJ7HDoctVly0P9+bknJPUAo
g+NugKgTGeiqH5GYp5CTZ9KvL91sDF4q00pfinITVdFl0E3VE293cIHlAzSQBm5/
w58xxaRV958ZvpH7EaBmYQG82QDi/eFNqeHqVGn0xAM6MlaqO7avUMQp2lRPYMCJ
u1z/NloR5yo+sffHxsn5Luiq9KqOf6zk33PuxEkKbN74OayCSPn/SeVCO7rQR0B6
yPMJTTcTiCLnId1kOWAPaEmuK2U3BW/+ogg7hKgeCQSysuy5n6Ok5a2vEx/gJRAm
v9yF68RmIWumpHr+QB0TmB8mVbD5SY+xWTm3CqJb9MipuFIOF7AVsPyTgucBvE7s
v+i5F6MDO6tcVfiDT4AiZEt6D2TM5RbtckkUEX3ZTD6j7CGuR5D8bH0HNRrghrYk
KT1lAk6tjWBOGAHc5Ji7
=Y3Xv
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The bulk of this update was stabilized before the merge window and
appeared in -next. The "device dax" implementation was revised this
week in response to review feedback, and to address failures detected
by the recently expanded ndctl unit test suite.
Not included in this pull request are two dax topic branches (dax
error handling, and dax radix-tree locking). These topics were
deferred to get a few more days of -next integration testing, and to
coordinate a branch baseline with Ted and the ext4 tree. Vishal and
Ross will send the error handling and locking topics respectively in
the next few days.
This branch has received a positive build result from the kbuild robot
across 226 configs.
Summary:
- Device DAX for persistent memory: Device DAX is the device-centric
analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
ranges to be allocated and mapped without need of an intervening
file system. Device DAX is strict, precise and predictable.
Specifically this interface:
a) Guarantees fault granularity with respect to a given page size
(pte, pmd, or pud) set at configuration time.
b) Enforces deterministic behavior by being strict about what
fault scenarios are supported.
Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance/feature
differentiated memory ranges.
- Support for the HPE DSM (device specific method) command formats.
This enables management of these first generation devices until a
unified DSM specification materializes.
- Further ACPI 6.1 compliance with support for the common dimm
identifier format.
- Various fixes and cleanups across the subsystem"
* tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
libnvdimm, dax: fix deletion
libnvdimm, dax: fix alignment validation
libnvdimm, dax: autodetect support
libnvdimm: release ida resources
Revert "block: enable dax for raw block devices"
/dev/dax, core: file operations and dax-mmap
/dev/dax, pmem: direct access to persistent memory
libnvdimm: stop requiring a driver ->remove() method
libnvdimm, dax: record the specified alignment of a dax-device instance
libnvdimm, dax: reserve space to store labels for device-dax
libnvdimm, dax: introduce device-dax infrastructure
nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
tools/testing/nvdimm: ND_CMD_CALL support
nfit: disable vendor specific commands
nfit: export subsystem ids as attributes
nfit: fix format interface code byte order per ACPI6.1
nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
nfit, libnvdimm: clarify "commands" vs "_DSMs"
libnvdimm: increase max envelope size for ioctl
acpi/nfit: Add sysfs "id" for NVDIMM ID
...
The "Device DAX" core enables dax mappings of performance / feature
differentiated memory. An open mapping or file handle keeps the backing
struct device live, but new mappings are only possible while the device
is enabled. Faults are handled under rcu_read_lock to synchronize
with the enabled state of the device.
Similar to the filesystem-dax case the backing memory may optionally
have struct page entries. However, unlike fs-dax there is no support
for private mappings, or mappings that are not backed by media (see
use of zero-page in fs-dax).
Mappings are always guaranteed to match the alignment of the dax_region.
If the dax_region is configured to have a 2MB alignment, all mappings
are guaranteed to be backed by a pmd entry. Contrast this determinism
with the fs-dax case where pmd mappings are opportunistic. If userspace
attempts to force a misaligned mapping, the driver will fail the mmap
attempt. See dax_dev_check_vma() for other scenarios that are rejected,
like MAP_PRIVATE mappings.
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If a large value is written to scan_sleep_millisecs, for example, that
period must lapse before khugepaged will wake up for periodic
collapsing.
If this value is tuned to 1 day, for example, and then re-tuned to its
default 10s, khugepaged will still wait for a day before scanning again.
This patch causes khugepaged to wakeup immediately when the value is
changed and then sleep until that value is rewritten or the new value
lapses.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment is partly wrong, this improves it by including the case of
split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
is set.
Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The implementation of mk_huge_pmd looks verbose, it could be just
simplified to one line code.
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
it was in the old vma, alignment and size permitting.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.
[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Manish Chopra <manish.chopra@qlogic.com>
Cc: Yuval Mintz <yuval.mintz@qlogic.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.
This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.
Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.
Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.
Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.
Mike Marciniszyn said:
: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data. The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous. The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!
Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Josh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org> [4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
split_huge_pages doesn't support get method at all, so the read
permission sounds confusing, change the permission to write only.
And, add "\n" to the output of set method to make it more readable.
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
it cannot distinguish private /dev/zero mappings from other special
mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.
This fixes false-positive VM_BUG_ON and prevents installing THP where
they are not expected.
Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
Fixes: 78f11a2557 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups")
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea has found[1] a race condition on MMU-gather based TLB flush vs
split_huge_page() or shrinker which frees huge zero under us (patch 1/2
and 2/2 respectively).
With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
page pinned until flush is complete and the pin prevents the page from
being split under us.
We still need patch 2/2. This is simplified version of Andrea's patch.
We don't need fancy encoding.
[1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights:
- Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras
- Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V
- Add POWER9 cputable entry from Michael Neuling
- FPU/Altivec/VSX save/restore optimisations from Cyril Bur
- Add support for new ftrace ABI on ppc64le from Torsten Duwe
Various cleanups & minor fixes from:
- Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril
Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey,
Sukadev Bhattiprolu, Suraj Jitindar Singh.
General:
- atomics: Allow architectures to define their own __atomic_op_* helpers from
Boqun Feng
- Implement atomic{, 64}_*_return_* variants and acquire/release/relaxed
variants for (cmp)xchg from Boqun Feng
- Add powernv_defconfig from Jeremy Kerr
- Fix BUG_ON() reporting in real mode from Balbir Singh
- Add xmon command to dump OPAL msglog from Andrew Donnellan
- Add xmon command to dump process/task similar to ps(1) from Douglas Miller
- Clean up memory hotplug failure paths from David Gibson
pci/eeh:
- Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei
Yang.
- EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
- PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
- PCI: Add pcibios_bus_add_device() weak function from Wei Yang
- MAINTAINERS: Update EEH details and maintainership from Russell Currey
cxl:
- Support added to the CXL driver for running on both bare-metal and
hypervisor systems, from Christophe Lombard and Frederic Barrat.
- Ignore probes for virtual afu pci devices from Vaibhav Jain
perf:
- Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu
- hv-24x7: Fix usage with chip events, display change in counter values,
display domain indices in sysfs, eliminate domain suffix in event names,
from Sukadev Bhattiprolu
Freescale:
- Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum
optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and
other dt bits, and minor fixes/cleanup."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW69OrAAoJEFHr6jzI4aWAe5EQAJw/hE6WBQc6a7Tj70AnXOqR
qk/m5pZjuTwQxfBteIvHR1pE5eXdlvtAjcD254LVkFkAbIn19W/h2k0VX/nlee7P
n/VRHRifjtGmukqHrPYJJ7ua9mNlY7pxh3leGSixBFASnSWqMxNNNziNQtSTcuCs
TjHiw6NkZ/kzeunA4bAfE4yHVUZjmL74oiS9JbLyaVHqoW4fqWLlh26AKo2yYMZI
qPicBBG4HBi3FGvoexnKxlJNdcV4HO7LzDjJmCSfUKYCJi+Pw19T5qmhso0q0qVz
vHg/A8HNeG4Hn83pNVmLeQSAIQRZ3DvTtcLgbjPo+TVwm/hzrRRBWipTeOVbkLW8
2bcOXT4t7LWUq15EAJ1LYgYZGzcLrfRfUeOcuQ1TWd3+PcfY9pE7FmizsxAAfaVe
E9j9mpz4XnIqBtWkFHneTIHkQ5OWptyKuZJEaYH0nut4VsP0k8NarkseafGqBPu7
5eG83gbiQbCVixfOgblV9eocJ29JcwpjPAY4CZSGJimShg909FV7WRgZgJkKWrbK
dBRco8Jcp4VglGfo2qymv7Uj4KwQoypBREOhiKUvrAsVlDxPfx+bcskhjGu9xGDC
xs/+nme0/lKa/wg5K4C3mQ1GAlkMWHI0ojhJjsyODbetup5UbkEu03wjAaTdO9dT
Y6ptGm0rYAJluPNlziFj
=qkAt
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"This was delayed a day or two by some build-breakage on old toolchains
which we've now fixed.
There's two PCI commits both acked by Bjorn.
There's one commit to mm/hugepage.c which is (co)authored by Kirill.
Highlights:
- Restructure Linux PTE on Book3S/64 to Radix format from Paul
Mackerras
- Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
Kumar K.V
- Add POWER9 cputable entry from Michael Neuling
- FPU/Altivec/VSX save/restore optimisations from Cyril Bur
- Add support for new ftrace ABI on ppc64le from Torsten Duwe
Various cleanups & minor fixes from:
- Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.
General:
- atomics: Allow architectures to define their own __atomic_op_*
helpers from Boqun Feng
- Implement atomic{, 64}_*_return_* variants and acquire/release/
relaxed variants for (cmp)xchg from Boqun Feng
- Add powernv_defconfig from Jeremy Kerr
- Fix BUG_ON() reporting in real mode from Balbir Singh
- Add xmon command to dump OPAL msglog from Andrew Donnellan
- Add xmon command to dump process/task similar to ps(1) from Douglas
Miller
- Clean up memory hotplug failure paths from David Gibson
pci/eeh:
- Redesign SR-IOV on PowerNV to give absolute isolation between VFs
from Wei Yang.
- EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
- PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
- PCI: Add pcibios_bus_add_device() weak function from Wei Yang
- MAINTAINERS: Update EEH details and maintainership from Russell
Currey
cxl:
- Support added to the CXL driver for running on both bare-metal and
hypervisor systems, from Christophe Lombard and Frederic Barrat.
- Ignore probes for virtual afu pci devices from Vaibhav Jain
perf:
- Export Power8 generic and cache events to sysfs from Sukadev
Bhattiprolu
- hv-24x7: Fix usage with chip events, display change in counter
values, display domain indices in sysfs, eliminate domain suffix in
event names, from Sukadev Bhattiprolu
Freescale:
- Updates from Scott: "Highlights include 8xx optimizations, 32-bit
checksum optimizations, 86xx consolidation, e5500/e6500 cpu
hotplug, more fman and other dt bits, and minor fixes/cleanup"
* tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
powerpc: Fix unrecoverable SLB miss during restore_math()
powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
powerpc/rcpm: Fix build break when SMP=n
powerpc/book3e-64: Use hardcoded mttmr opcode
powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
powerpc/T104xRDB: add tdm riser card node to device tree
powerpc32: PAGE_EXEC required for inittext
powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
powerpc/86xx: Introduce and use common dtsi
powerpc/86xx: Update device tree
powerpc/86xx: Move dts files to fsl directory
powerpc/86xx: Switch to kconfig fragments approach
powerpc/86xx: Update defconfigs
powerpc/86xx: Consolidate common platform code
powerpc32: Remove one insn in mulhdu
powerpc32: small optimisation in flush_icache_range()
powerpc: Simplify test in __dma_sync()
powerpc32: move xxxxx_dcache_range() functions inline
powerpc32: Remove clear_pages() and define clear_page() inline
...
split_huge_pmd() tries to munlock page with munlock_vma_page(). That
requires the page to locked.
If the is locked by caller, we would get a deadlock:
Unable to find swap-space signature
INFO: task trinity-c85:1907 blocked for more than 120 seconds.
Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
trinity-c85 D ffff88084d997608 0 1907 309 0x00000000
Call Trace:
schedule+0x9f/0x1c0
schedule_timeout+0x48e/0x600
io_schedule_timeout+0x1c3/0x390
bit_wait_io+0x29/0xd0
__wait_on_bit_lock+0x94/0x140
__lock_page+0x1d4/0x280
__split_huge_pmd+0x5a8/0x10f0
split_huge_pmd_address+0x1d9/0x230
try_to_unmap_one+0x540/0xc70
rmap_walk_anon+0x284/0x810
rmap_walk_locked+0x11e/0x190
try_to_unmap+0x1b1/0x4b0
split_huge_page_to_list+0x49d/0x18a0
follow_page_mask+0xa36/0xea0
SyS_move_pages+0xaf3/0x1570
entry_SYSCALL_64_fastpath+0x12/0x6b
2 locks held by trinity-c85/1907:
#0: (&mm->mmap_sem){++++++}, at: SyS_move_pages+0x933/0x1570
#1: (&anon_vma->rwsem){++++..}, at: split_huge_page_to_list+0x402/0x18a0
I don't think the deadlock is triggerable without split_huge_page()
simplifilcation patchset.
But munlock_vma_page() here is wrong: we want to munlock the page
unconditionally, no need in rmap lookup, that munlock_vma_page() does.
Let's use clear_page_mlock() instead. It can be called under ptl.
Fixes: e90309c9f7 ("thp: allow mlocked THP again")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
freeze_page() and unfreeze_page() helpers evolved in rather complex
beasts. It would be nice to cut complexity of this code.
This patch rewrites freeze_page() using standard try_to_unmap().
unfreeze_page() is rewritten with remove_migration_ptes().
The result is much simpler.
But the new variant is somewhat slower for PTE-mapped THPs. Current
helpers iterates over VMAs the compound page is mapped to, and then over
ptes within this VMA. New helpers iterates over small page, then over
VMA the small page mapped to, and only then find relevant pte.
We have short cut for PMD-mapped THP: we directly install migration
entries on PMD split.
I don't think the slowdown is critical, considering how much simpler
result is and that split_huge_page() is quite rare nowadays. It only
happens due memory pressure or migration.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for two ttu_flags:
- TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
unmap page;
- TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
Also, change rwc->done to !page_mapcount() instead of !page_mapped().
try_to_unmap() works on pte level, so we are really interested in the
mappedness of this small page rather than of the compound page it's a
part of.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel style prefers a single string over split strings when the string is
'user-visible'.
Miscellanea:
- Add a missing newline
- Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org> [percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.
In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.
In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.
First, this patch defines THP defrag as follows;
madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.
Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.
Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.
This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.
After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.
The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this
usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)
For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41
Which is substantial. Now, the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0
This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.
I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used
thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)
The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1
Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.
I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available
stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)
This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98
And once again we look at the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0
Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.
THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Count how many times we put a THP in split queue. Currently, it happens
on partial unmap of a THP.
Rapidly growing value can indicate that an application behaves
unfriendly wrt THP: often fault in huge page and then unmap part of it.
This leads to unnecessary memory fragmentation and the application may
require tuning.
The event also can help with debugging kernel [mis-]behaviour.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After one of bugfixes to freeze_page(), we don't have freezed pages in
rmap, therefore mapcount of all subpages of freezed THP is zero. And we
have assert for that.
Let's drop code which deal with non-zero mapcount of subpages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With next generation power processor, we are having a new mmu model
[1] that require us to maintain a different linux page table format.
Inorder to support both current and future ppc64 systems with a single
kernel we need to make sure kernel can select between different page
table format at runtime. With the new MMU (radix MMU) added, we will
have two different pmd hugepage size 16MB for hash model and 2MB for
Radix model. Hence make HPAGE_PMD related values as a variable.
Actual conversion of HPAGE_PMD to a variable for ppc64 happens in a
followup patch.
[1] http://ibm.biz/power-isa3 (Needs registration).
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- eeh: Fix partial hotplug criterion from Gavin Shan
- mm: Clear the invalid slot information correctly from Aneesh Kumar K.V
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWzXquAAoJEFHr6jzI4aWADHsP/2lbwqz/vS3Ep4zlySHNvStL
/DrRN2TN35THZ59FPRxgEfeqPxTCXtbpD6zEXwD0gf6m39I2zArhaQMOHXMtVPvV
p0nAtwR0PX/PxlQTJDpHlg074vVAD7s3iuvad6oNQObLcXhoZ7wYtbStZ9Ithm4R
YfqZTelzsw+GfMuTYnvAQf5aoRYztUpy7OheaJbbDmSZgMFwF896ZPJnaG9rAOPE
xcSsRaSfHiUR2NE2ua1K5yya+1ilZqrZhib7QxXgzGuxoVa2AAiPR7Hpx2kX1Wm+
z0DqPXISzRbVf9zyLgWD3TpJ4OMHI/CYVW+t/Gx/yWCMfNcfavUrh0vPdHRVEPZu
zxmIUoI6yv7jQ6bcfdzR5s0Mr5pYWlUj5MZg2r8aGqloYcLPk5DiENg+c0QmKI05
kQPCBoQz2ezzJWAt1BYshkc+mlimv3ODaNWFP34Nc6kcDaSO6a0rhVOecvKuR6dv
UBNpeh5np1rKq1wX0ri0yAmnm//yXqe+bK0I8Ctipi0++e73sVJGzfFdVvXwEhhW
h+v1BkdgW8WK/xlH+JCPiXd5dfXrUeFI0D65Kgpb7IbFc9hcXDmp2Dv7+8zx/Wcl
L2NpuucSDxi+LHkE10QiypgLWSKjn9OSi8PLocqABNXG8uHxIp54jRfyViBNALXF
XlPveqTgpt7On3aa0qVh
=bk3U
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-4' into next
Pull in our current fixes from 4.5, in particular the "Fix Multi hit
ERAT" bug is causing folks some grief when testing next.
Sebastian Ott and Gerald Schaefer reported random crashes on s390.
It was bisected to my THP refcounting patchset.
The problem is that pmdp_invalidated() called with wrong virtual
address. It got offset up by HPAGE_PMD_SIZE by loop over ptes.
The solution is to introduce new variable to be used in loop and don't
touch 'haddr'.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-and-tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reported-and-tested-by Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix build error on 32-bit with checkpoint restart from Aneesh Kumar
- Fix dedotify for binutils >= 2.26 from Andreas Schwab
- Don't trace hcalls on offline CPUs from Denis Kirjanov
- eeh: Fix stale cached primary bus from Gavin Shan
- eeh: Fix stale PE primary bus from Gavin Shan
- mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V
- ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWx8l5AAoJEFHr6jzI4aWAeY0P/AomeQCRieoBMKJi36WX4+gU
Cm1iBgM593VEM/KFsYtedm5+4QaCmPE+1tVm4/u0wbLeEQ8TqNLSZLniB9USE0hb
9655gGQyFE95BZa8WfbqHOI7+BK+TkUOWGY0CfyqPVrknzSN2MCDHjUaNo1wge6l
zmIYIkKhaQAinFSFovOdjQ63rYdk6CxsfgbP1Gl2aX0cmzWW1n07AvZLqNmLFJ+4
L3uBXPcrEKY/nfkRi+FutoTb86ggt9J9dqCfJHHfWKn60qwhpKwiva84k3jI/BOu
yBTFeNzlobXt0ceDSWx1ITXzKmJQokWGC5+Lo+0mDb4veAbhLgHlXdx7NUcZIB6+
YGYGSOkeKCnbnInIOGLz45LlevJFviI94y0YY4tt++Csq/IjnBhDeTkGx7zcqRLG
v5hl7AhykHd3Me5iRuyRRVoVyk6+318OZW450Oxxj7EtkzpSeLfHCMKxk5w1Nxuk
tenWQeApdkTVr91m5VJNFOsrmtFDLJv51C8duiUFWzc195ejSMYDR86K+qBeaebs
39obXHVYRnCrn9TzODIR9SnEd7dHImekQ4a3G3F54mLJ4qqUN089TDqBGY2GNuT8
j8QVBttp3SWuZSvtvDJLxoFt2QTKxcuiMQ4FX/OAS4qWRjSR8v2WTCyBZt68l7er
kpUnIelJSuIDVLdNuFlf
=7Yzi
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix build error on 32-bit with checkpoint restart from Aneesh Kumar
- Fix dedotify for binutils >= 2.26 from Andreas Schwab
- Don't trace hcalls on offline CPUs from Denis Kirjanov
- eeh: Fix stale cached primary bus from Gavin Shan
- eeh: Fix stale PE primary bus from Gavin Shan
- mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V
- ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy
* tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/ioda: Set "read" permission when "write" is set
powerpc/mm: Fix Multi hit ERAT cause by recent THP update
powerpc/powernv: Fix stale PE primary bus
powerpc/eeh: Fix stale cached primary bus
powerpc/pseries: Don't trace hcalls on offline CPUs
powerpc: Fix dedotify for binutils >= 2.26
powerpc/book3s_32: Fix build error with checkpoint restart
DAX doesn't deposit pgtables when it maps huge pages: nothing to
withdraw. It can lead to crash.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.
Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.
Consider following race:
CPU0 CPU1
shrink_page_list()
add_to_swap()
split_huge_page_to_list()
__split_huge_pmd_locked()
pmdp_huge_clear_flush_notify()
// pmd_none() == true
exit_mmap()
unmap_vmas()
zap_pmd_range()
// no action on pmd since pmd_none() == true
pmd_populate()
As result the THP will not be freed. The leak is detected by check_mm():
BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
The above required us to not mark pmd none during a pmd split.
The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)
Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.
Fixes: eef1b3ba05 ("thp: implement split_huge_pmd()")
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We need to iterate over split_queue, not local empty list to get
anything split from the shrinker.
Fixes: e3ae19535c ("thp: limit number of object to scan on deferred_split_scan()")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We allocate a pgtable but do not attach it to anything if the PMD is in
a DAX VMA, causing it to leak.
We certainly try to not free pgtables associated with the huge zero page
if the zero page is in a DAX VMA, so I think this is the right solution.
This needs to be properly audited.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we have a lot of pages in queue to be split, deferred_split_scan()
can spend unreasonable amount of time under spinlock with disabled
interrupts.
Let's cap number of pages to split on scan by sc->nr_to_scan.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I've got meaning of shrinker::count_objects() wrong: it should return
number of potentially freeable objects, which is not necessary correlate
with freeable memory.
Returning 256 per THP in queue is not reasonable:
shrinker::scan_objects() never called with nr_to_scan > 128 in my setup.
Let's return 1 per THP and correct scan_object accordingly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
ptl doesn't make much sense in this case.
Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A newly added tracepoint in the hugepage code uses a variable in the
error handling that is not initialized at that point:
include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The result is relatively harmless, as the trace data will in rare
cases contain incorrect data.
This works around the problem by adding an explicit initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 7d2eba0557 ("mm: add tracepoint for scanning pages")
Reviewed-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b8d3c4c300 ("mm/huge_memory.c: don't split THP page when
MADV_FREE syscall is called") introduced this new function, but got the
error handling for when pmd_trans_huge_lock() fails wrong. In the
failure case, the lock has not been taken, and we should not unlock on
the way out.
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.
Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.
[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to this change DAX PMD mappings that were made read-only were
never able to be made writable again. This is because the code in
insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
these calls if the PMD already existed in the page table.
Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark
the PMD as read-only, and then on a subsequent write fault we get into
an infinite loop of PMD faults where we try unsuccessfully to make the
PMD writeable.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sasha Levin has reported KASAN out-of-bounds bug[1]. It points to "if
(!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.
The cause is that split_huge_page() doesn't handle THP correctly if it's
not allingned to PMD boundary. It can happen after mremap().
Test-case (not always triggers the bug):
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#define MB (1024UL*1024)
#define SIZE (2*MB)
#define BASE ((void *)0x400000000000)
int main()
{
char *p;
p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
-1, 0);
if (p == MAP_FAILED)
perror("mmap"), exit(1);
p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
BASE + SIZE + 8192);
if (p == MAP_FAILED)
perror("mremap"), exit(1);
system("echo 1 > /sys/kernel/debug/split_huge_pages");
return 0;
}
The patch fixes freeze and unfreeze paths to handle page table boundary
crossing.
It also makes mapcount vs count check in split_huge_page_to_list()
stricter:
- after freeze we don't expect any subpage mapped as we remove them
from rmap when setting up migration entries;
- count must be 1, meaning only caller has reference to the page;
[1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size. The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy. With
that, we could avoid unnecessary THP split.
For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During freeze_page(), we remove the page from rmap. It munlocks the
page if it was mlocked. clear_page_mlock() uses thelru cache, which
temporary pins the page.
Let's drain the lru cache before checking page's count vs. mapcount.
The change makes mlocked page split on first attempt, if it was not
pinned by somebody else.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Writing 1 into 'split_huge_pages' will try to find and split all huge
pages in the system. This is useful for debuging.
[akpm@linux-foundation.org: fix printk text, per Vlastimil]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.
The patch removes PageTransHuge() test from the functions and opencode
page table check.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before THP refcounting rework, THP was not allowed to cross VMA
boundary. So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.
With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
- the VMA is split into two, one of them VM_LOCKED;
- huge PMD split into PTE table;
- THP is still mlocked;
3. split_huge_page():
- it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.
We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.
Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().
This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan. After the split vmscan will detect
unevictable small pages and mlock them.
With this approach we shouldn't hit situation like described above.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.
Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.
It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.
The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds implementation of split_huge_page() for new
refcountings.
Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.
The basic scheme of split_huge_page():
- Check that sum of mapcounts of all subpage is equal to page_count()
plus one (caller pin). Foll off with -EBUSY. This way we can avoid
useless PMD-splits.
- Freeze the page counters by splitting all PMD and setup migration
PTEs.
- Re-check sum of mapcounts against page_count(). Page's counts are
stable now. -EBUSY if page is pinned.
- Split compound page.
- Unfreeze the page by removing migration entries.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to use migration PTE entries to stabilize page counts. If
the page is mapped with PMDs we need to split the PMD and setup
migration entries. It's reasonable to combine these operations to avoid
double-scanning over the page table.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>