After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
ptl doesn't make much sense in this case.
Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.
Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.
[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size. The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy. With
that, we could avoid unnecessary THP split.
For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.
The patch removes PageTransHuge() test from the functions and opencode
page table check.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.
Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.
It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.
The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds implementation of split_huge_page() for new
refcountings.
Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.
The basic scheme of split_huge_page():
- Check that sum of mapcounts of all subpage is equal to page_count()
plus one (caller pin). Foll off with -EBUSY. This way we can avoid
useless PMD-splits.
- Freeze the page counters by splitting all PMD and setup migration
PTEs.
- Re-check sum of mapcounts against page_count(). Page's counts are
stable now. -EBUSY if page is pinned.
- Split compound page.
- Unfreeze the page by removing migration entries.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.
Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We will re-introduce new version with new refcounting later in patchset.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_'
prefix instead of 'vm_' prefix is intended to indicate that it returns a
VMF_ value rather than an errno (which would only have to be converted
into a VMF_ value anyway).
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To use the huge zero page in DAX, we need these functions exported.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series of patches adds support for using PMD page table entries to
map DAX files. We expect NV-DIMMs to start showing up that are many
gigabytes in size and the memory consumption of 4kB PTEs will be
astronomical.
The patch series leverages much of the Transparant Huge Pages
infrastructure, going so far as to borrow one of Kirill's patches from
his THP page cache series.
This patch (of 10):
Since we're going to have huge pages in page cache, we need to call adjust
file-backed VMA, which potentially can contain huge pages.
For now we call it for all VMAs.
Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
detect zero_page in /proc/kpageflags, and then do memory analysis more
accurately.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
more information when they trigger.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 71e3aac072 ("thp: transparent hugepage core") adds
copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this
function have been used outside of memory.c, but it currently isn't.
This patch makes copy_pte_range() static again.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).
This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.
This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.
This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.
Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we don't clobber page_tail->first_page during split_huge_page,
so compound_trans_head can be set to compound_head without adverse
effects, and this mostly optimizes away a smp_rmb.
It looks worthwhile to keep around the implementation that doesn't relay
on page_tail->first_page not to be clobbered, because it would be
necessary if we'll decide to enforce page->private to zero at all times
whenever PG_private is not set, also for anonymous pages. For anonymous
pages enforcing such an invariant doesn't matter as anonymous pages
don't use page->private so we can get away with this microoptimization.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With split ptlock it's important to know which lock
pmd_trans_huge_lock() took. This patch adds one more parameter to the
function to return the lock.
In most places migration to new api is trivial. Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc changes for the 3.11 merge window. In addition to
the usual bug fixes and small updates, the main highlights are:
- Support for transparent huge pages by Aneesh Kumar for 64-bit
server processors. This allows the use of 16M pages as transparent
huge pages on kernels compiled with a 64K base page size.
- Base VFIO support for KVM on power by Alexey Kardashevskiy
- Wiring up of our nvram to the pstore infrastructure, including
putting compressed oopses in there by Aruna Balakrishnaiah
- Move, rework and improve our "EEH" (basically PCI error handling
and recovery) infrastructure. It is no longer specific to pseries
but is now usable by the new "powernv" platform as well (no
hypervisor) by Gavin Shan.
- I fixed some bugs in our math-emu instruction decoding and made it
usable to emulate some optional FP instructions on processors with
hard FP that lack them (such as fsqrt on Freescale embedded
processors).
- Support for Power8 "Event Based Branch" facility by Michael
Ellerman. This facility allows what is basically "userspace
interrupts" for performance monitor events.
- A bunch of Transactional Memory vs. Signals bug fixes and HW
breakpoint/watchpoint fixes by Michael Neuling.
And more ... I appologize in advance if I've failed to highlight
something that somebody deemed worth it."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
pstore: Add hsize argument in write_buf call of pstore_ftrace_call
powerpc/fsl: add MPIC timer wakeup support
powerpc/mpic: create mpic subsystem object
powerpc/mpic: add global timer support
powerpc/mpic: add irq_set_wake support
powerpc/85xx: enable coreint for all the 64bit boards
powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
powerpc/mpic: Add get_version API both for internal and external use
powerpc: Handle both new style and old style reserve maps
powerpc/hw_brk: Fix off by one error when validating DAWR region end
powerpc/pseries: Support compression of oops text via pstore
powerpc/pseries: Re-organise the oops compression code
pstore: Pass header size in the pstore write callback
powerpc/powernv: Fix iommu initialization again
powerpc/pseries: Inform the hypervisor we are using EBB regs
powerpc/perf: Add power8 EBB support
powerpc/perf: Core EBB support for 64-bit book3s
powerpc/perf: Drop MMCRA from thread_struct
powerpc/perf: Don't enable if we have zero events
...
Currently, HPAGE_PMD_* constans rely on PMD_SHIFT regardless of
CONFIG_TRANSPARENT_HUGEPAGE. PMD_SHIFT is not defined everywhere (e.g.
arm nommu case).
It means we can't use anything like this in generic code:
if (PageTransHuge(page))
zero_huge_user(page, 0, HPAGE_PMD_SIZE);
else
clear_highpage(page);
For !THP case, PageTransHuge() is 0 and compiler can eliminate
zero_huge_user() call. But it still need to be valid C expression, means
HPAGE_PMD_SIZE has to expand to something compiler can understand.
Previously, HPAGE_PMD_* were defined to BUILD_BUG() for !THP. Let's come
back to it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For architectures like powerpc that support multiple explicit hugepage
sizes, HPAGE_SHIFT indicate the default explicit hugepage shift. For THP
to work the hugepage size should be same as PMD_SIZE. So use PMD_SHIFT
directly. So move the define outside CONFIG_TRANSPARENT_HUGEPAGE #ifdef
because we want to use these defines in generic code with if
(pmd_trans_huge()) conditional.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
All Transparent Huge Pages are allocated by the buddy allocator.
A compile time check is in place that fails when the order of a
transparent huge page is too large to be allocated by the buddy
allocator. Unfortunately that compile time check passes when:
HPAGE_PMD_ORDER == MAX_ORDER
( which is incorrect as the buddy allocator can only allocate
memory of order strictly less than MAX_ORDER. )
This patch updates the compile time check to fail in the above
case.
Signed-off-by: Steve Capper <steve.capper@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
In page reclaim, huge page is split. split_huge_page() adds tail pages
to LRU list. Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.
Before this patch, run a swap workload:
thp_fault_alloc 3492
thp_fault_fallback 608
thp_collapse_alloc 6
thp_collapse_alloc_failed 0
thp_split 916
With this patch:
thp_fault_alloc 4085
thp_fault_fallback 16
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_split 1272
fallback allocation is reduced a lot.
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in commit 4fc3f1d66b ("mm/rmap, migration: Make
rmap_walk_anon() and try_to_unmap_anon() more scalable") says:
| Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
| to make it clearer that it's an exclusive write-lock in
| that case - suggested by Rik van Riel.
But that commit renames only anon_vma_lock()
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
Ka0JKgnWvsa6ez6FSzKI
=ivQa
-----END PGP SIGNATURE-----
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma
Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
By default kernel tries to use huge zero page on read page fault. It's
possible to disable huge zero page by writing 0 or enable it back by
writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass vma instead of mm and add address parameter.
In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
This change is preparation to huge zero pmd splitting implementation.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On x86 memory accesses to pages without the ACCESSED flag set result in
the ACCESSED flag being set automatically. With the ARM architecture a
page access fault is raised instead (and it will continue to be raised
until the ACCESSED flag is set for the appropriate PTE/PMD).
For normal memory pages, handle_pte_fault will call pte_mkyoung
(effectively setting the ACCESSED flag). For transparent huge pages,
pmd_mkyoung will only be called for a write fault.
This patch ensures that faults on transparent hugepages which do not
result in a CoW update the access flags for the faulting pmd.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ni zhan Chen <nizhan.chen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.
Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:
96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread
With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.
Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
This patch converts change_prot_numa() to use change_protection(). As
pte_numa and friends check the PTE bits directly it is necessary for
change_protection() to use pmd_mknuma(). Hence the required
modifications to change_protection() are a little clumsy but the
end result is that most of the numa page table helpers are just one or
two instructions.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
sufficiently different that the signed-off-bys were dropped
Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.
Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Note: This patch started as "mm/mpol: Create special PROT_NONE
infrastructure" and preserves the basic idea but steals *very*
heavily from "autonuma: numa hinting page faults entry points" for
the actual fault handlers without the migration parts. The end
result is barely recognisable as either patch so all Signed-off
and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
this version, I will re-add the signed-offs-by to reflect the history.
In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.
The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
NUMA faults for the reason that the page fault code checks the permission on
the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
before it ever calls handle_mm_fault.
[dhillf@gmail.com: Fix typo]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
When a transparent hugepage is mapped and it is included in an mlock()
range, follow_page() incorrectly avoids setting the page's mlock bit and
moving it to the unevictable lru.
This is evident if you try to mlock(), munlock(), and then mlock() a
range again. Currently:
#define MAP_SIZE (4 << 30) /* 4GB */
void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
mlock(ptr, MAP_SIZE);
$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
Inactive(anon): 6304 kB
Unevictable: 4213924 kB
munlock(ptr, MAP_SIZE);
Inactive(anon): 4186252 kB
Unevictable: 19652 kB
mlock(ptr, MAP_SIZE);
Inactive(anon): 4198556 kB
Unevictable: 21684 kB
Notice that less than 2MB was added to the unevictable list; this is
because these pages in the range are not transparent hugepages since the
4GB range was allocated with mmap() and has no specific alignment. If
posix_memalign() were used instead, unevictable would not have grown at
all on the second mlock().
The fix is to call mlock_vma_page() so that the mlock bit is set and the
page is added to the unevictable list. With this patch:
mlock(ptr, MAP_SIZE);
Inactive(anon): 4056 kB
Unevictable: 4213940 kB
munlock(ptr, MAP_SIZE);
Inactive(anon): 4198268 kB
Unevictable: 19636 kB
mlock(ptr, MAP_SIZE);
Inactive(anon): 4008 kB
Unevictable: 4213940 kB
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The thp page table pre-allocation code currently assumes that pgtable_t is
of type "struct page *". This may not be true for all architectures, so
this patch removes that assumption by replacing the functions
prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
can be defined architecture-specific.
It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be
no functional change introduced by this patch.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When transparent_hugepage_enabled() is used outside mm/, such as in
arch/x86/xx/tlb.c:
+ if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB
+ || transparent_hugepage_enabled(vma)) {
+ flush_tlb_mm(vma->vm_mm);
is_vma_temporary_stack() isn't referenced in huge_mm.h, so it has compile
errors:
arch/x86/mm/tlb.c: In function `flush_tlb_range':
arch/x86/mm/tlb.c:324:4: error: implicit declaration of function `is_vma_temporary_stack' [-Werror=implicit-function-declaration]
Since is_vma_temporay_stack() is just used in rmap.c and huge_memory.c, it
is better to move it to huge_mm.h from rmap.h to avoid such errors.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These macros will be used in a later patch, where all usages are expected
to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
detect unexpected usages, we convert the existing BUG() to BUILD_BUG().
[akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently when we check if we can handle thp as it is or we need to split
it into regular sized pages, we hold page table lock prior to check
whether a given pmd is mapping thp or not. Because of this, when it's not
"huge pmd" we suffer from unnecessary lock/unlock overhead. To remove it,
this patch introduces a optimized check function and replace several
similar logics with it.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
flushed, but not a corresponding API for pmd entry. This isn't a
problem so far because THP is only for x86 currently and tlb_flush()
under x86 will flush entire TLB. But this is confusion and could be
missed if thp is ported to other arch.
Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
__tlb_remove_page() as suggested by Andrea Arcangeli. The
__tlb_remove_page() function is supposed to be called after
tlb_remove_xxx_tlb_entry() and we can catch any misuse.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Straightforward conversion of anon_vma->lock to a mutex.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The huge_memory.c THP page fault was allowed to run if vm_ops was null
(which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
setup a special vma->vm_ops and it would fallback to regular anonymous
memory) but other THP logics weren't fully activated for vmas with vm_file
not NULL (/dev/zero has a not NULL vma->vm_file).
So this removes the vm_file checks so that /dev/zero also can safely use
THP (the other albeit safer approach to fix this bug would have been to
prevent the THP initial page fault to run if vm_file was set).
After removing the vm_file checks, this also makes huge_memory.c stricter
in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
only be allowed to exist before the first page fault, and in turn when
vma->anon_vma is null (so preventing khugepaged registration). So I tend
to think the previous comment saying if vm_file was set, VM_PFNMAP might
have been set and we could still be registered in khugepaged (despite
anon_vma was not NULL to be registered in khugepaged) was too paranoid.
The is_linear_pfn_mapping check is also I think superfluous (as described
by comment) but under DEBUG_VM it is safe to stay.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Caspar Zhang <bugs@casparzhang.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org> [2.6.38.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transparent hugepages can only be created if rmap is fully
functional. So we must prevent hugepages to be created while
is_vma_temporary_stack() is true.
This also optmizes away some harmless but unnecessary setting of
khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup some code with common compound_trans_head helper.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory. While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.
MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).
MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>