2006-03-22 16:09:12 +08:00
|
|
|
/*
|
2015-11-06 10:49:43 +08:00
|
|
|
* Memory Migration functionality - linux/mm/migrate.c
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
|
|
|
* Copyright (C) 2006 Silicon Graphics, Inc., Christoph Lameter
|
|
|
|
*
|
|
|
|
* Page migration was first developed in the context of the memory hotplug
|
|
|
|
* project. The main authors of the migration code are:
|
|
|
|
*
|
|
|
|
* IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
|
|
|
|
* Hirokazu Takahashi <taka@valinux.co.jp>
|
|
|
|
* Dave Hansen <haveblue@us.ibm.com>
|
2008-07-05 00:59:22 +08:00
|
|
|
* Christoph Lameter
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/migrate.h>
|
2011-10-16 14:01:52 +08:00
|
|
|
#include <linux/export.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/swap.h>
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
#include <linux/swapops.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/pagemap.h>
|
2006-04-11 13:52:57 +08:00
|
|
|
#include <linux/buffer_head.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2007-10-19 14:40:14 +08:00
|
|
|
#include <linux/nsproxy.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/pagevec.h>
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
#include <linux/ksm.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/rmap.h>
|
|
|
|
#include <linux/topology.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cpuset.h>
|
2006-06-23 17:03:38 +08:00
|
|
|
#include <linux/writeback.h>
|
2006-06-23 17:03:55 +08:00
|
|
|
#include <linux/mempolicy.h>
|
|
|
|
#include <linux/vmalloc.h>
|
2006-06-23 17:04:02 +08:00
|
|
|
#include <linux/security.h>
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
#include <linux/backing-dev.h>
|
2008-07-24 12:27:02 +08:00
|
|
|
#include <linux/syscalls.h>
|
2010-09-08 09:19:35 +08:00
|
|
|
#include <linux/hugetlb.h>
|
2012-08-01 07:42:27 +08:00
|
|
|
#include <linux/hugetlb_cgroup.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/gfp.h>
|
2012-12-12 08:02:42 +08:00
|
|
|
#include <linux/balloon_compaction.h>
|
2013-12-19 09:08:33 +08:00
|
|
|
#include <linux/mmu_notifier.h>
|
mm: introduce idle page tracking
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 06:35:45 +08:00
|
|
|
#include <linux/page_idle.h>
|
2016-03-16 05:56:15 +08:00
|
|
|
#include <linux/page_owner.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2010-12-22 09:24:26 +08:00
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
|
2012-10-19 21:07:31 +08:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/migrate.h>
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
|
|
|
/*
|
2006-06-23 17:03:55 +08:00
|
|
|
* migrate_prep() needs to be called before we start compiling a list of pages
|
2010-05-25 05:32:27 +08:00
|
|
|
* to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
|
|
|
|
* undesirable, use migrate_prep_local()
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
|
|
|
int migrate_prep(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Clear the LRU lists so pages can be isolated.
|
|
|
|
* Note that pages may be moved off the LRU after we have
|
|
|
|
* drained them. Those pages will fail to migrate like other
|
|
|
|
* pages that may be busy.
|
|
|
|
*/
|
|
|
|
lru_add_drain_all();
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-25 05:32:27 +08:00
|
|
|
/* Do the necessary work of migrate_prep but not if it involves other CPUs */
|
|
|
|
int migrate_prep_local(void)
|
|
|
|
{
|
|
|
|
lru_add_drain();
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-12 08:02:47 +08:00
|
|
|
/*
|
|
|
|
* Put previously isolated pages back onto the appropriate lists
|
|
|
|
* from where they were once taken off for compaction/migration.
|
|
|
|
*
|
2014-01-22 07:51:17 +08:00
|
|
|
* This function shall be used whenever the isolated pageset has been
|
|
|
|
* built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
|
|
|
|
* and isolate_huge_page().
|
2012-12-12 08:02:47 +08:00
|
|
|
*/
|
|
|
|
void putback_movable_pages(struct list_head *l)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
struct page *page2;
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
list_for_each_entry_safe(page, page2, l, lru) {
|
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 05:21:59 +08:00
|
|
|
if (unlikely(PageHuge(page))) {
|
|
|
|
putback_active_hugepage(page);
|
|
|
|
continue;
|
|
|
|
}
|
2006-06-23 17:03:51 +08:00
|
|
|
list_del(&page->lru);
|
2009-09-22 08:01:37 +08:00
|
|
|
dec_zone_page_state(page, NR_ISOLATED_ANON +
|
2009-09-22 08:02:59 +08:00
|
|
|
page_is_file_cache(page));
|
2013-10-01 04:45:16 +08:00
|
|
|
if (unlikely(isolated_balloon_page(page)))
|
2012-12-12 08:02:42 +08:00
|
|
|
balloon_page_putback(page);
|
|
|
|
else
|
|
|
|
putback_lru_page(page);
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
/*
|
|
|
|
* Restore a potential migration pte to a working pte entry
|
|
|
|
*/
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, void *old)
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
swp_entry_t entry;
|
|
|
|
pmd_t *pmd;
|
|
|
|
pte_t *ptep, pte;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
if (unlikely(PageHuge(new))) {
|
|
|
|
ptep = huge_pte_offset(mm, addr);
|
|
|
|
if (!ptep)
|
|
|
|
goto out;
|
2013-11-15 06:31:02 +08:00
|
|
|
ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
|
2010-09-08 09:19:35 +08:00
|
|
|
} else {
|
2012-12-12 08:00:37 +08:00
|
|
|
pmd = mm_find_pmd(mm, addr);
|
|
|
|
if (!pmd)
|
2010-09-08 09:19:35 +08:00
|
|
|
goto out;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
ptep = pte_offset_map(pmd, addr);
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
mm: fix race between mremap and removing migration entry
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.
The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.
Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.
Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.
Reported-and-tested-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-20 03:50:35 +08:00
|
|
|
/*
|
|
|
|
* Peek to check is_swap_pte() before taking ptlock? No, we
|
|
|
|
* can race mremap's move_ptes(), which skips anon_vma lock.
|
|
|
|
*/
|
2010-09-08 09:19:35 +08:00
|
|
|
|
|
|
|
ptl = pte_lockptr(mm, pmd);
|
|
|
|
}
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
|
|
|
spin_lock(ptl);
|
|
|
|
pte = *ptep;
|
|
|
|
if (!is_swap_pte(pte))
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
goto unlock;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
|
|
|
entry = pte_to_swp_entry(pte);
|
|
|
|
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
if (!is_migration_entry(entry) ||
|
|
|
|
migration_entry_to_page(entry) != old)
|
|
|
|
goto unlock;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
|
|
|
get_page(new);
|
|
|
|
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
|
2013-10-17 04:46:51 +08:00
|
|
|
if (pte_swp_soft_dirty(*ptep))
|
|
|
|
pte = pte_mksoft_dirty(pte);
|
2014-10-03 02:47:41 +08:00
|
|
|
|
|
|
|
/* Recheck VMA as permissions can change since migration started */
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
if (is_write_migration_entry(entry))
|
2014-10-03 02:47:41 +08:00
|
|
|
pte = maybe_mkwrite(pte, vma);
|
|
|
|
|
2010-10-11 22:03:21 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
2013-02-05 06:28:46 +08:00
|
|
|
if (PageHuge(new)) {
|
2010-09-08 09:19:35 +08:00
|
|
|
pte = pte_mkhuge(pte);
|
2013-02-05 06:28:46 +08:00
|
|
|
pte = arch_make_huge_pte(pte, vma, new, 0);
|
|
|
|
}
|
2010-10-11 22:03:21 +08:00
|
|
|
#endif
|
2013-05-25 06:55:18 +08:00
|
|
|
flush_dcache_page(new);
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
set_pte_at(mm, addr, ptep, pte);
|
2006-06-23 17:03:38 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
if (PageHuge(new)) {
|
|
|
|
if (PageAnon(new))
|
|
|
|
hugepage_add_anon_rmap(new, vma, addr);
|
|
|
|
else
|
2016-01-16 08:53:42 +08:00
|
|
|
page_dup_rmap(new, true);
|
2010-09-08 09:19:35 +08:00
|
|
|
} else if (PageAnon(new))
|
2016-01-16 08:52:16 +08:00
|
|
|
page_add_anon_rmap(new, vma, addr, false);
|
2006-06-23 17:03:38 +08:00
|
|
|
else
|
|
|
|
page_add_file_rmap(new);
|
|
|
|
|
2016-03-18 05:20:07 +08:00
|
|
|
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
|
2015-11-06 10:49:37 +08:00
|
|
|
mlock_vma_page(new);
|
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
/* No need to invalidate - it was non-present before */
|
MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
On VIVT ARM, when we have multiple shared mappings of the same file
in the same MM, we need to ensure that we have coherency across all
copies. We do this via make_coherent() by making the pages
uncacheable.
This used to work fine, until we allowed highmem with highpte - we
now have a page table which is mapped as required, and is not available
for modification via update_mmu_cache().
Ralf Beache suggested getting rid of the PTE value passed to
update_mmu_cache():
On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
to construct a pointer to the pte again. Passing a pte_t * is much
more elegant. Maybe we might even replace the pte argument with the
pte_t?
Ben Herrenschmidt would also like the pte pointer for PowerPC:
Passing the ptep in there is exactly what I want. I want that
-instead- of the PTE value, because I have issue on some ppc cases,
for I$/D$ coherency, where set_pte_at() may decide to mask out the
_PAGE_EXEC.
So, pass in the mapped page table pointer into update_mmu_cache(), and
remove the PTE value, updating all implementations and call sites to
suit.
Includes a fix from Stephen Rothwell:
sparc: fix fallout from update_mmu_cache API change
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-12-19 00:40:18 +08:00
|
|
|
update_mmu_cache(vma, addr, ptep);
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
unlock:
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
pte_unmap_unlock(ptep, ptl);
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
out:
|
|
|
|
return SWAP_AGAIN;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
/*
|
|
|
|
* Get rid of all migration entries and replace them by
|
|
|
|
* references to the indicated page.
|
|
|
|
*/
|
2016-03-18 05:20:07 +08:00
|
|
|
void remove_migration_ptes(struct page *old, struct page *new, bool locked)
|
2006-06-23 17:03:38 +08:00
|
|
|
{
|
2014-01-22 07:49:48 +08:00
|
|
|
struct rmap_walk_control rwc = {
|
|
|
|
.rmap_one = remove_migration_pte,
|
|
|
|
.arg = old,
|
|
|
|
};
|
|
|
|
|
2016-03-18 05:20:07 +08:00
|
|
|
if (locked)
|
|
|
|
rmap_walk_locked(new, &rwc);
|
|
|
|
else
|
|
|
|
rmap_walk(new, &rwc);
|
2006-06-23 17:03:38 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
/*
|
|
|
|
* Something used the pte of a page under migration. We need to
|
|
|
|
* get to the page and wait until migration is finished.
|
|
|
|
* When we return from this function the fault will be retried.
|
|
|
|
*/
|
mm/hugetlb: take page table lock in follow_huge_pmd()
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 07:25:22 +08:00
|
|
|
void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
|
2013-06-13 05:05:04 +08:00
|
|
|
spinlock_t *ptl)
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
{
|
2013-06-13 05:05:04 +08:00
|
|
|
pte_t pte;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
swp_entry_t entry;
|
|
|
|
struct page *page;
|
|
|
|
|
2013-06-13 05:05:04 +08:00
|
|
|
spin_lock(ptl);
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
pte = *ptep;
|
|
|
|
if (!is_swap_pte(pte))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
entry = pte_to_swp_entry(pte);
|
|
|
|
if (!is_migration_entry(entry))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
page = migration_entry_to_page(entry);
|
|
|
|
|
2008-07-26 10:45:30 +08:00
|
|
|
/*
|
|
|
|
* Once radix-tree replacement of page migration started, page_count
|
|
|
|
* *must* be zero. And, we don't want to call wait_on_page_locked()
|
|
|
|
* against a page without get_page().
|
|
|
|
* So, we use get_page_unless_zero(), here. Even failed, page fault
|
|
|
|
* will occur again.
|
|
|
|
*/
|
|
|
|
if (!get_page_unless_zero(page))
|
|
|
|
goto out;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
pte_unmap_unlock(ptep, ptl);
|
|
|
|
wait_on_page_locked(page);
|
|
|
|
put_page(page);
|
|
|
|
return;
|
|
|
|
out:
|
|
|
|
pte_unmap_unlock(ptep, ptl);
|
|
|
|
}
|
|
|
|
|
2013-06-13 05:05:04 +08:00
|
|
|
void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
spinlock_t *ptl = pte_lockptr(mm, pmd);
|
|
|
|
pte_t *ptep = pte_offset_map(pmd, address);
|
|
|
|
__migration_entry_wait(mm, ptep, ptl);
|
|
|
|
}
|
|
|
|
|
2013-11-15 06:31:02 +08:00
|
|
|
void migration_entry_wait_huge(struct vm_area_struct *vma,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
2013-06-13 05:05:04 +08:00
|
|
|
{
|
2013-11-15 06:31:02 +08:00
|
|
|
spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte);
|
2013-06-13 05:05:04 +08:00
|
|
|
__migration_entry_wait(mm, pte, ptl);
|
|
|
|
}
|
|
|
|
|
2012-01-13 09:19:34 +08:00
|
|
|
#ifdef CONFIG_BLOCK
|
|
|
|
/* Returns true if all buffers are successfully locked */
|
2012-01-13 09:19:43 +08:00
|
|
|
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
|
|
|
|
enum migrate_mode mode)
|
2012-01-13 09:19:34 +08:00
|
|
|
{
|
|
|
|
struct buffer_head *bh = head;
|
|
|
|
|
|
|
|
/* Simple case, sync compaction */
|
2012-01-13 09:19:43 +08:00
|
|
|
if (mode != MIGRATE_ASYNC) {
|
2012-01-13 09:19:34 +08:00
|
|
|
do {
|
|
|
|
get_bh(bh);
|
|
|
|
lock_buffer(bh);
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
|
|
|
|
} while (bh != head);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* async case, we cannot block on lock_buffer so use trylock_buffer */
|
|
|
|
do {
|
|
|
|
get_bh(bh);
|
|
|
|
if (!trylock_buffer(bh)) {
|
|
|
|
/*
|
|
|
|
* We failed to lock the buffer and cannot stall in
|
|
|
|
* async migration. Release the taken locks
|
|
|
|
*/
|
|
|
|
struct buffer_head *failed_bh = bh;
|
|
|
|
put_bh(failed_bh);
|
|
|
|
bh = head;
|
|
|
|
while (bh != failed_bh) {
|
|
|
|
unlock_buffer(bh);
|
|
|
|
put_bh(bh);
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
} while (bh != head);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
|
2012-01-13 09:19:43 +08:00
|
|
|
enum migrate_mode mode)
|
2012-01-13 09:19:34 +08:00
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_BLOCK */
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
2006-06-23 17:03:32 +08:00
|
|
|
* Replace the page in the mapping.
|
2006-06-23 17:03:29 +08:00
|
|
|
*
|
|
|
|
* The number of remaining references must be:
|
|
|
|
* 1 for anonymous pages without a mapping
|
|
|
|
* 2 for pages with a mapping
|
2009-04-03 23:42:36 +08:00
|
|
|
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
2013-07-16 17:56:16 +08:00
|
|
|
int migrate_page_move_mapping(struct address_space *mapping,
|
2012-01-13 09:19:34 +08:00
|
|
|
struct page *newpage, struct page *page,
|
2013-12-22 06:56:08 +08:00
|
|
|
struct buffer_head *head, enum migrate_mode mode,
|
|
|
|
int extra_count)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
struct zone *oldzone, *newzone;
|
|
|
|
int dirty;
|
2013-12-22 06:56:08 +08:00
|
|
|
int expected_count = 1 + extra_count;
|
2006-12-07 12:33:44 +08:00
|
|
|
void **pslot;
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-06-23 17:03:37 +08:00
|
|
|
if (!mapping) {
|
2007-04-24 05:41:09 +08:00
|
|
|
/* Anonymous page without mapping */
|
2013-12-22 06:56:08 +08:00
|
|
|
if (page_count(page) != expected_count)
|
2006-06-23 17:03:37 +08:00
|
|
|
return -EAGAIN;
|
2015-11-06 10:50:02 +08:00
|
|
|
|
|
|
|
/* No turning back from here */
|
|
|
|
newpage->index = page->index;
|
|
|
|
newpage->mapping = page->mapping;
|
|
|
|
if (PageSwapBacked(page))
|
|
|
|
SetPageSwapBacked(newpage);
|
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
return MIGRATEPAGE_SUCCESS;
|
2006-06-23 17:03:37 +08:00
|
|
|
}
|
|
|
|
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
oldzone = page_zone(page);
|
|
|
|
newzone = page_zone(newpage);
|
|
|
|
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-12-07 12:33:44 +08:00
|
|
|
pslot = radix_tree_lookup_slot(&mapping->page_tree,
|
|
|
|
page_index(page));
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2013-12-22 06:56:08 +08:00
|
|
|
expected_count += 1 + page_has_private(page);
|
2008-07-26 10:45:30 +08:00
|
|
|
if (page_count(page) != expected_count ||
|
2011-01-14 07:47:21 +08:00
|
|
|
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2006-04-11 13:52:57 +08:00
|
|
|
return -EAGAIN;
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
|
2016-03-18 05:19:26 +08:00
|
|
|
if (!page_ref_freeze(page, expected_count)) {
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2008-07-26 10:45:30 +08:00
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
2012-01-13 09:19:34 +08:00
|
|
|
/*
|
|
|
|
* In the async migration case of moving a page with buffers, lock the
|
|
|
|
* buffers using trylock before the mapping is moved. If the mapping
|
|
|
|
* was moved, we later failed to lock the buffers and could not move
|
|
|
|
* the mapping back due to an elevated page count, we would have to
|
|
|
|
* block waiting on other references to be dropped.
|
|
|
|
*/
|
2012-01-13 09:19:43 +08:00
|
|
|
if (mode == MIGRATE_ASYNC && head &&
|
|
|
|
!buffer_migrate_lock_buffers(head, mode)) {
|
2016-03-18 05:19:26 +08:00
|
|
|
page_ref_unfreeze(page, expected_count);
|
2012-01-13 09:19:34 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
2015-11-06 10:50:02 +08:00
|
|
|
* Now we know that no one else is looking at the page:
|
|
|
|
* no turning back from here.
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
2015-11-06 10:50:02 +08:00
|
|
|
newpage->index = page->index;
|
|
|
|
newpage->mapping = page->mapping;
|
|
|
|
if (PageSwapBacked(page))
|
|
|
|
SetPageSwapBacked(newpage);
|
|
|
|
|
2006-12-07 12:33:44 +08:00
|
|
|
get_page(newpage); /* add cache reference */
|
2006-03-22 16:09:12 +08:00
|
|
|
if (PageSwapCache(page)) {
|
|
|
|
SetPageSwapCache(newpage);
|
|
|
|
set_page_private(newpage, page_private(page));
|
|
|
|
}
|
|
|
|
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
/* Move dirty while page refs frozen and newpage not yet exposed */
|
|
|
|
dirty = PageDirty(page);
|
|
|
|
if (dirty) {
|
|
|
|
ClearPageDirty(page);
|
|
|
|
SetPageDirty(newpage);
|
|
|
|
}
|
|
|
|
|
2006-12-07 12:33:44 +08:00
|
|
|
radix_tree_replace_slot(pslot, newpage);
|
|
|
|
|
|
|
|
/*
|
2012-01-11 07:07:11 +08:00
|
|
|
* Drop cache reference from old page by unfreezing
|
|
|
|
* to one less reference.
|
2006-12-07 12:33:44 +08:00
|
|
|
* We know this isn't the last reference.
|
|
|
|
*/
|
2016-03-18 05:19:26 +08:00
|
|
|
page_ref_unfreeze(page, expected_count - 1);
|
2006-12-07 12:33:44 +08:00
|
|
|
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
spin_unlock(&mapping->tree_lock);
|
|
|
|
/* Leave irq disabled to prevent preemption while updating stats */
|
|
|
|
|
2007-04-24 05:41:09 +08:00
|
|
|
/*
|
|
|
|
* If moved to a different zone then also account
|
|
|
|
* the page for that zone. Other VM counters will be
|
|
|
|
* taken care of when we establish references to the
|
|
|
|
* new page and drop references to the old page.
|
|
|
|
*
|
|
|
|
* Note that anonymous pages are accounted for
|
|
|
|
* via NR_FILE_PAGES and NR_ANON_PAGES if they
|
|
|
|
* are mapped to swap space.
|
|
|
|
*/
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
if (newzone != oldzone) {
|
|
|
|
__dec_zone_state(oldzone, NR_FILE_PAGES);
|
|
|
|
__inc_zone_state(newzone, NR_FILE_PAGES);
|
|
|
|
if (PageSwapBacked(page) && !PageSwapCache(page)) {
|
|
|
|
__dec_zone_state(oldzone, NR_SHMEM);
|
|
|
|
__inc_zone_state(newzone, NR_SHMEM);
|
|
|
|
}
|
|
|
|
if (dirty && mapping_cap_account_dirty(mapping)) {
|
|
|
|
__dec_zone_state(oldzone, NR_FILE_DIRTY);
|
|
|
|
__inc_zone_state(newzone, NR_FILE_DIRTY);
|
|
|
|
}
|
2009-09-22 08:01:33 +08:00
|
|
|
}
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
local_irq_enable();
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
return MIGRATEPAGE_SUCCESS;
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
/*
|
|
|
|
* The expected number of remaining references is the same as that
|
|
|
|
* of migrate_page_move_mapping().
|
|
|
|
*/
|
|
|
|
int migrate_huge_page_move_mapping(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
|
|
|
{
|
|
|
|
int expected_count;
|
|
|
|
void **pslot;
|
|
|
|
|
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
|
|
|
|
|
|
|
pslot = radix_tree_lookup_slot(&mapping->page_tree,
|
|
|
|
page_index(page));
|
|
|
|
|
|
|
|
expected_count = 2 + page_has_private(page);
|
|
|
|
if (page_count(page) != expected_count ||
|
2011-01-14 07:47:21 +08:00
|
|
|
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
|
2010-09-08 09:19:35 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
2016-03-18 05:19:26 +08:00
|
|
|
if (!page_ref_freeze(page, expected_count)) {
|
2010-09-08 09:19:35 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
2015-11-06 10:50:02 +08:00
|
|
|
newpage->index = page->index;
|
|
|
|
newpage->mapping = page->mapping;
|
2016-03-16 05:57:19 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
get_page(newpage);
|
|
|
|
|
|
|
|
radix_tree_replace_slot(pslot, newpage);
|
|
|
|
|
2016-03-18 05:19:26 +08:00
|
|
|
page_ref_unfreeze(page, expected_count - 1);
|
2010-09-08 09:19:35 +08:00
|
|
|
|
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2016-03-16 05:57:19 +08:00
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
return MIGRATEPAGE_SUCCESS;
|
2010-09-08 09:19:35 +08:00
|
|
|
}
|
|
|
|
|
2013-11-22 06:31:58 +08:00
|
|
|
/*
|
|
|
|
* Gigantic pages are so large that we do not guarantee that page++ pointer
|
|
|
|
* arithmetic will work across the entire page. We need something more
|
|
|
|
* specialized.
|
|
|
|
*/
|
|
|
|
static void __copy_gigantic_page(struct page *dst, struct page *src,
|
|
|
|
int nr_pages)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct page *dst_base = dst;
|
|
|
|
struct page *src_base = src;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; ) {
|
|
|
|
cond_resched();
|
|
|
|
copy_highpage(dst, src);
|
|
|
|
|
|
|
|
i++;
|
|
|
|
dst = mem_map_next(dst, dst_base, i);
|
|
|
|
src = mem_map_next(src, src_base, i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void copy_huge_page(struct page *dst, struct page *src)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int nr_pages;
|
|
|
|
|
|
|
|
if (PageHuge(src)) {
|
|
|
|
/* hugetlbfs page */
|
|
|
|
struct hstate *h = page_hstate(src);
|
|
|
|
nr_pages = pages_per_huge_page(h);
|
|
|
|
|
|
|
|
if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
|
|
|
|
__copy_gigantic_page(dst, src, nr_pages);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* thp page */
|
|
|
|
BUG_ON(!PageTransHuge(src));
|
|
|
|
nr_pages = hpage_nr_pages(src);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
cond_resched();
|
|
|
|
copy_highpage(dst + i, src + i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* Copy the page to its new location
|
|
|
|
*/
|
2010-09-08 09:19:35 +08:00
|
|
|
void migrate_page_copy(struct page *newpage, struct page *page)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
2013-10-07 18:29:23 +08:00
|
|
|
int cpupid;
|
|
|
|
|
2012-11-19 20:35:47 +08:00
|
|
|
if (PageHuge(page) || PageTransHuge(page))
|
2010-09-08 09:19:35 +08:00
|
|
|
copy_huge_page(newpage, page);
|
|
|
|
else
|
|
|
|
copy_highpage(newpage, page);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
|
|
|
if (PageError(page))
|
|
|
|
SetPageError(newpage);
|
|
|
|
if (PageReferenced(page))
|
|
|
|
SetPageReferenced(newpage);
|
|
|
|
if (PageUptodate(page))
|
|
|
|
SetPageUptodate(newpage);
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
if (TestClearPageActive(page)) {
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(PageUnevictable(page), page);
|
2006-03-22 16:09:12 +08:00
|
|
|
SetPageActive(newpage);
|
2009-12-15 09:59:54 +08:00
|
|
|
} else if (TestClearPageUnevictable(page))
|
|
|
|
SetPageUnevictable(newpage);
|
2006-03-22 16:09:12 +08:00
|
|
|
if (PageChecked(page))
|
|
|
|
SetPageChecked(newpage);
|
|
|
|
if (PageMappedToDisk(page))
|
|
|
|
SetPageMappedToDisk(newpage);
|
|
|
|
|
mm: migrate dirty page without clear_page_dirty_for_io etc
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 10:50:05 +08:00
|
|
|
/* Move dirty on pages not done by migrate_page_move_mapping() */
|
|
|
|
if (PageDirty(page))
|
|
|
|
SetPageDirty(newpage);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
mm: introduce idle page tracking
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 06:35:45 +08:00
|
|
|
if (page_is_young(page))
|
|
|
|
set_page_young(newpage);
|
|
|
|
if (page_is_idle(page))
|
|
|
|
set_page_idle(newpage);
|
|
|
|
|
2013-10-07 18:29:23 +08:00
|
|
|
/*
|
|
|
|
* Copy NUMA information to the new page, to prevent over-eager
|
|
|
|
* future migrations of this same page.
|
|
|
|
*/
|
|
|
|
cpupid = page_cpupid_xchg_last(page, -1);
|
|
|
|
page_cpupid_xchg_last(newpage, cpupid);
|
|
|
|
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
ksm_migrate_page(newpage, page);
|
ksm: make KSM page migration possible
KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.
But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree? And what if
that collides with an existing stable node?
So far there's no provision for that, and this patch does not attempt to
deal with it either. But how will I test a solution, when I don't know
how to hotremove memory? The best answer is to enable KSM page migration
in all cases now, and test more common cases. With THP and compaction
added since KSM came in, page migration is now mainstream, and it's a
shame that a KSM page can frustrate freeing a page block.
Without worrying about merge_across_nodes 0 for now, this patch gets KSM
page migration working reliably for default merge_across_nodes 1 (but
leave the patch enabling it until near the end of the series).
It's much simpler than I'd originally imagined, and does not require an
additional tier of locking: page migration relies on the page lock, KSM
page reclaim relies on the page lock, the page lock is enough for KSM page
migration too.
Almost all the care has to be in get_ksm_page(): that's the function which
worries about when a stable node is stale and should be freed, now it also
has to worry about the KSM page being migrated.
The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here. That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
swapcache); but this works well, I've no urge to pull it apart now.
(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)
You might worry about mremap, and whether page migration's rmap_walk to
remove migration entries will find all the KSM locations where it inserted
earlier: that should already be handled, by the satisfyingly heavy hammer
of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 08:35:10 +08:00
|
|
|
/*
|
|
|
|
* Please do not reorder this without considering how mm/ksm.c's
|
|
|
|
* get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
|
|
|
|
*/
|
2015-04-16 07:13:15 +08:00
|
|
|
if (PageSwapCache(page))
|
|
|
|
ClearPageSwapCache(page);
|
2006-03-22 16:09:12 +08:00
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(page, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If any waiters have accumulated on the new page then
|
|
|
|
* wake them up.
|
|
|
|
*/
|
|
|
|
if (PageWriteback(newpage))
|
|
|
|
end_page_writeback(newpage);
|
2016-03-16 05:56:15 +08:00
|
|
|
|
|
|
|
copy_page_owner(page, newpage);
|
2016-03-16 05:57:54 +08:00
|
|
|
|
|
|
|
mem_cgroup_migrate(page, newpage);
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:28 +08:00
|
|
|
/************************************************************
|
|
|
|
* Migration functions
|
|
|
|
***********************************************************/
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* Common logic to directly migrate a single page suitable for
|
2009-04-03 23:42:36 +08:00
|
|
|
* pages that do not use PagePrivate/PagePrivate2.
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
|
|
|
* Pages are locked upon entry and exit.
|
|
|
|
*/
|
2006-06-23 17:03:33 +08:00
|
|
|
int migrate_page(struct address_space *mapping,
|
2012-01-13 09:19:43 +08:00
|
|
|
struct page *newpage, struct page *page,
|
|
|
|
enum migrate_mode mode)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
|
|
|
|
|
2013-12-22 06:56:08 +08:00
|
|
|
rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
if (rc != MIGRATEPAGE_SUCCESS)
|
2006-03-22 16:09:12 +08:00
|
|
|
return rc;
|
|
|
|
|
|
|
|
migrate_page_copy(newpage, page);
|
2012-12-12 08:02:31 +08:00
|
|
|
return MIGRATEPAGE_SUCCESS;
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(migrate_page);
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
#ifdef CONFIG_BLOCK
|
2006-06-23 17:03:28 +08:00
|
|
|
/*
|
|
|
|
* Migration function for pages with buffers. This function can only be used
|
|
|
|
* if the underlying filesystem guarantees that no other references to "page"
|
|
|
|
* exist.
|
|
|
|
*/
|
2006-06-23 17:03:33 +08:00
|
|
|
int buffer_migrate_page(struct address_space *mapping,
|
2012-01-13 09:19:43 +08:00
|
|
|
struct page *newpage, struct page *page, enum migrate_mode mode)
|
2006-06-23 17:03:28 +08:00
|
|
|
{
|
|
|
|
struct buffer_head *bh, *head;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (!page_has_buffers(page))
|
2012-01-13 09:19:43 +08:00
|
|
|
return migrate_page(mapping, newpage, page, mode);
|
2006-06-23 17:03:28 +08:00
|
|
|
|
|
|
|
head = page_buffers(page);
|
|
|
|
|
2013-12-22 06:56:08 +08:00
|
|
|
rc = migrate_page_move_mapping(mapping, newpage, page, head, mode, 0);
|
2006-06-23 17:03:28 +08:00
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
if (rc != MIGRATEPAGE_SUCCESS)
|
2006-06-23 17:03:28 +08:00
|
|
|
return rc;
|
|
|
|
|
2012-01-13 09:19:34 +08:00
|
|
|
/*
|
|
|
|
* In the async case, migrate_page_move_mapping locked the buffers
|
|
|
|
* with an IRQ-safe spinlock held. In the sync case, the buffers
|
|
|
|
* need to be locked now
|
|
|
|
*/
|
2012-01-13 09:19:43 +08:00
|
|
|
if (mode != MIGRATE_ASYNC)
|
|
|
|
BUG_ON(!buffer_migrate_lock_buffers(head, mode));
|
2006-06-23 17:03:28 +08:00
|
|
|
|
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(newpage, page_private(page));
|
|
|
|
set_page_private(page, 0);
|
|
|
|
put_page(page);
|
|
|
|
get_page(newpage);
|
|
|
|
|
|
|
|
bh = head;
|
|
|
|
do {
|
|
|
|
set_bh_page(bh, newpage, bh_offset(bh));
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
|
|
|
|
} while (bh != head);
|
|
|
|
|
|
|
|
SetPagePrivate(newpage);
|
|
|
|
|
|
|
|
migrate_page_copy(newpage, page);
|
|
|
|
|
|
|
|
bh = head;
|
|
|
|
do {
|
|
|
|
unlock_buffer(bh);
|
|
|
|
put_bh(bh);
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
|
|
|
|
} while (bh != head);
|
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
return MIGRATEPAGE_SUCCESS;
|
2006-06-23 17:03:28 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(buffer_migrate_page);
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
#endif
|
2006-06-23 17:03:28 +08:00
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
/*
|
|
|
|
* Writeback a page to clean the dirty state
|
|
|
|
*/
|
|
|
|
static int writeout(struct address_space *mapping, struct page *page)
|
2006-06-23 17:03:33 +08:00
|
|
|
{
|
2006-06-23 17:03:38 +08:00
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_NONE,
|
|
|
|
.nr_to_write = 1,
|
|
|
|
.range_start = 0,
|
|
|
|
.range_end = LLONG_MAX,
|
|
|
|
.for_reclaim = 1
|
|
|
|
};
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (!mapping->a_ops->writepage)
|
|
|
|
/* No write method for the address space */
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (!clear_page_dirty_for_io(page))
|
|
|
|
/* Someone else already triggered a write */
|
|
|
|
return -EAGAIN;
|
|
|
|
|
2006-06-23 17:03:33 +08:00
|
|
|
/*
|
2006-06-23 17:03:38 +08:00
|
|
|
* A dirty page may imply that the underlying filesystem has
|
|
|
|
* the page on some queue. So the page must be clean for
|
|
|
|
* migration. Writeout may mean we loose the lock and the
|
|
|
|
* page state is no longer what we checked for earlier.
|
|
|
|
* At this point we know that the migration attempt cannot
|
|
|
|
* be successful.
|
2006-06-23 17:03:33 +08:00
|
|
|
*/
|
2016-03-18 05:20:07 +08:00
|
|
|
remove_migration_ptes(page, page, false);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
rc = mapping->a_ops->writepage(page, &wbc);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
if (rc != AOP_WRITEPAGE_ACTIVATE)
|
|
|
|
/* unlocked. Relock */
|
|
|
|
lock_page(page);
|
|
|
|
|
2008-11-20 07:36:36 +08:00
|
|
|
return (rc < 0) ? -EIO : -EAGAIN;
|
2006-06-23 17:03:38 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Default handling if a filesystem does not provide a migration function.
|
|
|
|
*/
|
|
|
|
static int fallback_migrate_page(struct address_space *mapping,
|
2012-01-13 09:19:43 +08:00
|
|
|
struct page *newpage, struct page *page, enum migrate_mode mode)
|
2006-06-23 17:03:38 +08:00
|
|
|
{
|
2012-01-13 09:19:34 +08:00
|
|
|
if (PageDirty(page)) {
|
2012-01-13 09:19:43 +08:00
|
|
|
/* Only writeback pages in full synchronous migration */
|
|
|
|
if (mode != MIGRATE_SYNC)
|
2012-01-13 09:19:34 +08:00
|
|
|
return -EBUSY;
|
2006-06-23 17:03:38 +08:00
|
|
|
return writeout(mapping, page);
|
2012-01-13 09:19:34 +08:00
|
|
|
}
|
2006-06-23 17:03:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Buffers may be managed in a filesystem specific way.
|
|
|
|
* We must have no buffers or drop them.
|
|
|
|
*/
|
2009-04-03 23:42:36 +08:00
|
|
|
if (page_has_private(page) &&
|
2006-06-23 17:03:33 +08:00
|
|
|
!try_to_release_page(page, GFP_KERNEL))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
2012-01-13 09:19:43 +08:00
|
|
|
return migrate_page(mapping, newpage, page, mode);
|
2006-06-23 17:03:33 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
/*
|
|
|
|
* Move a page to a newly allocated page
|
|
|
|
* The page is locked and all ptes have been successfully removed.
|
|
|
|
*
|
|
|
|
* The new page will have replaced the old page if this function
|
|
|
|
* is successful.
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
*
|
|
|
|
* Return value:
|
|
|
|
* < 0 - error code
|
2012-12-12 08:02:31 +08:00
|
|
|
* MIGRATEPAGE_SUCCESS - success
|
2006-06-23 17:03:51 +08:00
|
|
|
*/
|
2010-05-25 05:32:20 +08:00
|
|
|
static int move_to_new_page(struct page *newpage, struct page *page,
|
2015-11-06 10:49:53 +08:00
|
|
|
enum migrate_mode mode)
|
2006-06-23 17:03:51 +08:00
|
|
|
{
|
|
|
|
struct address_space *mapping;
|
|
|
|
int rc;
|
|
|
|
|
2015-11-06 10:49:49 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
|
2006-06-23 17:03:51 +08:00
|
|
|
|
|
|
|
mapping = page_mapping(page);
|
|
|
|
if (!mapping)
|
2012-01-13 09:19:43 +08:00
|
|
|
rc = migrate_page(mapping, newpage, page, mode);
|
2012-01-13 09:19:34 +08:00
|
|
|
else if (mapping->a_ops->migratepage)
|
2006-06-23 17:03:51 +08:00
|
|
|
/*
|
2012-01-13 09:19:34 +08:00
|
|
|
* Most pages have a mapping and most filesystems provide a
|
|
|
|
* migratepage callback. Anonymous pages are part of swap
|
|
|
|
* space which also has its own migratepage callback. This
|
|
|
|
* is the most common path for page migration.
|
2006-06-23 17:03:51 +08:00
|
|
|
*/
|
2015-11-06 10:49:53 +08:00
|
|
|
rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
|
2012-01-13 09:19:34 +08:00
|
|
|
else
|
2012-01-13 09:19:43 +08:00
|
|
|
rc = fallback_migrate_page(mapping, newpage, page, mode);
|
2006-06-23 17:03:51 +08:00
|
|
|
|
2015-11-06 10:49:53 +08:00
|
|
|
/*
|
|
|
|
* When successful, old pagecache page->mapping must be cleared before
|
|
|
|
* page is freed; but stats require that PageAnon be left as PageAnon.
|
|
|
|
*/
|
|
|
|
if (rc == MIGRATEPAGE_SUCCESS) {
|
|
|
|
if (!PageAnon(page))
|
|
|
|
page->mapping = NULL;
|
2010-05-25 05:32:20 +08:00
|
|
|
}
|
2006-06-23 17:03:51 +08:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2011-11-01 08:06:57 +08:00
|
|
|
static int __unmap_and_move(struct page *page, struct page *newpage,
|
2013-02-23 08:35:14 +08:00
|
|
|
int force, enum migrate_mode mode)
|
2006-06-23 17:03:51 +08:00
|
|
|
{
|
2011-11-01 08:06:57 +08:00
|
|
|
int rc = -EAGAIN;
|
2014-12-13 08:56:19 +08:00
|
|
|
int page_was_mapped = 0;
|
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:17 +08:00
|
|
|
struct anon_vma *anon_vma = NULL;
|
2006-06-23 17:03:53 +08:00
|
|
|
|
2008-08-02 18:01:03 +08:00
|
|
|
if (!trylock_page(page)) {
|
2012-01-13 09:19:43 +08:00
|
|
|
if (!force || mode == MIGRATE_ASYNC)
|
2011-11-01 08:06:57 +08:00
|
|
|
goto out;
|
2011-01-14 07:45:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It's not safe for direct compaction to call lock_page.
|
|
|
|
* For example, during page readahead pages are added locked
|
|
|
|
* to the LRU. Later, when the IO completes the pages are
|
|
|
|
* marked uptodate and unlocked. However, the queueing
|
|
|
|
* could be merging multiple pages for one bio (e.g.
|
|
|
|
* mpage_readpages). If an allocation happens for the
|
|
|
|
* second or third page, the process can end up locking
|
|
|
|
* the same page twice and deadlocking. Rather than
|
|
|
|
* trying to be clever about what pages can be locked,
|
|
|
|
* avoid the use of lock_page for direct compaction
|
|
|
|
* altogether.
|
|
|
|
*/
|
|
|
|
if (current->flags & PF_MEMALLOC)
|
2011-11-01 08:06:57 +08:00
|
|
|
goto out;
|
2011-01-14 07:45:56 +08:00
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
lock_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (PageWriteback(page)) {
|
2011-03-23 07:33:11 +08:00
|
|
|
/*
|
2013-04-30 06:07:58 +08:00
|
|
|
* Only in the case of a full synchronous migration is it
|
2012-01-13 09:19:43 +08:00
|
|
|
* necessary to wait for PageWriteback. In the async case,
|
|
|
|
* the retry loop is too short and in the sync-light case,
|
|
|
|
* the overhead of stalling is too much
|
2011-03-23 07:33:11 +08:00
|
|
|
*/
|
2012-01-13 09:19:43 +08:00
|
|
|
if (mode != MIGRATE_SYNC) {
|
2011-03-23 07:33:11 +08:00
|
|
|
rc = -EBUSY;
|
mm: memcontrol: rewrite uncharge API
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.
Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:
- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.
- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.
- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().
But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.
For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.
mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.
Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.
Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.
[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:22 +08:00
|
|
|
goto out_unlock;
|
2011-03-23 07:33:11 +08:00
|
|
|
}
|
|
|
|
if (!force)
|
mm: memcontrol: rewrite uncharge API
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.
Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:
- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.
- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.
- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().
But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.
For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.
mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.
Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.
Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.
[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:19:22 +08:00
|
|
|
goto out_unlock;
|
2006-06-23 17:03:51 +08:00
|
|
|
wait_on_page_writeback(page);
|
|
|
|
}
|
2015-11-06 10:49:56 +08:00
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
/*
|
2007-07-27 01:41:07 +08:00
|
|
|
* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
|
|
|
|
* we cannot notice that anon_vma is freed while we migrates a page.
|
2011-01-14 07:47:30 +08:00
|
|
|
* This get_anon_vma() delays freeing anon_vma pointer until the end
|
2007-07-27 01:41:07 +08:00
|
|
|
* of migration. File cache pages are no problem because of page_lock()
|
2007-08-31 14:56:21 +08:00
|
|
|
* File Caches may use write_page() or lock_page() in migration, then,
|
|
|
|
* just care Anon page here.
|
2015-11-06 10:49:56 +08:00
|
|
|
*
|
|
|
|
* Only page_get_anon_vma() understands the subtleties of
|
|
|
|
* getting a hold on an anon_vma from outside one of its mms.
|
|
|
|
* But if we cannot get anon_vma, then we won't need it anyway,
|
|
|
|
* because that implies that the anon page is no longer mapped
|
|
|
|
* (and cannot be remapped so long as we hold the page lock).
|
2007-07-27 01:41:07 +08:00
|
|
|
*/
|
2015-11-06 10:49:56 +08:00
|
|
|
if (PageAnon(page) && !PageKsm(page))
|
2011-05-25 08:12:10 +08:00
|
|
|
anon_vma = page_get_anon_vma(page);
|
2008-02-05 14:29:33 +08:00
|
|
|
|
2015-11-06 10:49:49 +08:00
|
|
|
/*
|
|
|
|
* Block others from accessing the new page when we get around to
|
|
|
|
* establishing additional references. We are usually the only one
|
|
|
|
* holding a reference to newpage at this point. We used to have a BUG
|
|
|
|
* here if trylock_page(newpage) fails, but would like to allow for
|
|
|
|
* cases where there might be a race with the previous use of newpage.
|
|
|
|
* This is much like races on refcount of oldpage: just don't BUG().
|
|
|
|
*/
|
|
|
|
if (unlikely(!trylock_page(newpage)))
|
|
|
|
goto out_unlock;
|
|
|
|
|
2014-10-10 06:29:27 +08:00
|
|
|
if (unlikely(isolated_balloon_page(page))) {
|
2012-12-12 08:02:42 +08:00
|
|
|
/*
|
|
|
|
* A ballooned page does not need any special attention from
|
|
|
|
* physical to virtual reverse mapping procedures.
|
|
|
|
* Skip any attempt to unmap PTEs or to remap swap cache,
|
|
|
|
* in order to avoid burning cycles at rmap level, and perform
|
|
|
|
* the page migration right away (proteced by page lock).
|
|
|
|
*/
|
|
|
|
rc = balloon_page_migrate(newpage, page, mode);
|
2015-11-06 10:49:49 +08:00
|
|
|
goto out_unlock_both;
|
2012-12-12 08:02:42 +08:00
|
|
|
}
|
|
|
|
|
2007-07-27 01:41:07 +08:00
|
|
|
/*
|
2008-02-05 14:29:33 +08:00
|
|
|
* Corner case handling:
|
|
|
|
* 1. When a new swap-cache page is read into, it is added to the LRU
|
|
|
|
* and treated as swapcache but it has no rmap yet.
|
|
|
|
* Calling try_to_unmap() against a page->mapping==NULL page will
|
|
|
|
* trigger a BUG. So handle it here.
|
|
|
|
* 2. An orphaned page (see truncate_complete_page) might have
|
|
|
|
* fs-private metadata. The page can be picked up due to memory
|
|
|
|
* offlining. Everywhere else except page reclaim, the page is
|
|
|
|
* invisible to the vm, so the page can not be migrated. So try to
|
|
|
|
* free the metadata, so the page can be freed.
|
2006-06-23 17:03:51 +08:00
|
|
|
*/
|
2008-02-05 14:29:33 +08:00
|
|
|
if (!page->mapping) {
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(PageAnon(page), page);
|
2011-01-14 07:47:30 +08:00
|
|
|
if (page_has_private(page)) {
|
2008-02-05 14:29:33 +08:00
|
|
|
try_to_free_buffers(page);
|
2015-11-06 10:49:49 +08:00
|
|
|
goto out_unlock_both;
|
2008-02-05 14:29:33 +08:00
|
|
|
}
|
2015-11-06 10:49:49 +08:00
|
|
|
} else if (page_mapped(page)) {
|
|
|
|
/* Establish migration ptes */
|
2015-11-06 10:49:56 +08:00
|
|
|
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
|
|
|
|
page);
|
2014-12-13 08:56:19 +08:00
|
|
|
try_to_unmap(page,
|
mm/hwpoison: fix race between soft_offline_page and unpoison_memory
Wanpeng Li reported a race between soft_offline_page() and
unpoison_memory(), which causes the following kernel panic:
BUG: Bad page state in process bash pfn:97000
page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
flags: 0x1fffff80080048(uptodate|active|swapbacked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x40(active)
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
Call Trace:
dump_stack+0x48/0x5c
bad_page+0xe6/0x140
free_pages_prepare+0x2f9/0x320
? uncharge_list+0xdd/0x100
free_hot_cold_page+0x40/0x170
__put_single_page+0x20/0x30
put_page+0x25/0x40
unmap_and_move+0x1a6/0x1f0
migrate_pages+0x100/0x1d0
? kill_procs+0x100/0x100
? unlock_page+0x6f/0x90
__soft_offline_page+0x127/0x2a0
soft_offline_page+0xa6/0x200
This race is explained like below:
CPU0 CPU1
soft_offline_page
__soft_offline_page
TestSetPageHWPoison
unpoison_memory
PageHWPoison check (true)
TestClearPageHWPoison
put_page -> release refcount held by get_hwpoison_page in unpoison_memory
put_page -> release refcount held by isolate_lru_page in __soft_offline_page
migrate_pages
The second put_page() releases refcount held by isolate_lru_page() which
will lead to unmap_and_move() releases the last refcount of page and w/
mapcount still 1 since try_to_unmap() is not called if there is only one
user map the page. Anyway, the page refcount and mapcount will still
mess if the page is mapped by multiple users.
This race was introduced by commit 4491f71260 ("mm/memory-failure: set
PageHWPoison before migrate_pages()"), which focuses on preventing the
reuse of successfully migrated page. Before this commit we prevent the
reuse by changing the migratetype to MIGRATE_ISOLATE during soft
offlining, which has the following problems, so simply reverting the
commit is not a best option:
1) it doesn't eliminate the reuse completely, because
set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
target page if the pageblock of the page contains one or more
unmovable pages (i.e. has_unmovable_pages() returns true).
2) the original code changes migratetype to MIGRATE_ISOLATE
forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
regardless of the original migratetype state, which could impact
other subsystems like memory hotplug or compaction.
This patch moves PageSetHWPoison just after put_page() in
unmap_and_move(), which closes up the reported race window and minimizes
another race window b/w SetPageHWPoison and reallocation (which causes
the reuse of soft-offlined page.) The latter race window still exists
but it's acceptable, because it's rare and effectively the same as
ordinary "containment failure" case even if it happens, so keep the
window open is acceptable.
Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 06:03:27 +08:00
|
|
|
TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
|
2014-12-13 08:56:19 +08:00
|
|
|
page_was_mapped = 1;
|
|
|
|
}
|
2007-07-27 01:41:07 +08:00
|
|
|
|
2006-06-25 20:46:49 +08:00
|
|
|
if (!page_mapped(page))
|
2015-11-06 10:49:53 +08:00
|
|
|
rc = move_to_new_page(newpage, page, mode);
|
2006-06-23 17:03:51 +08:00
|
|
|
|
2015-11-06 10:49:53 +08:00
|
|
|
if (page_was_mapped)
|
|
|
|
remove_migration_ptes(page,
|
2016-03-18 05:20:07 +08:00
|
|
|
rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
|
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:17 +08:00
|
|
|
|
2015-11-06 10:49:49 +08:00
|
|
|
out_unlock_both:
|
|
|
|
unlock_page(newpage);
|
|
|
|
out_unlock:
|
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:17 +08:00
|
|
|
/* Drop an anon_vma reference if we took one */
|
2010-08-10 08:18:41 +08:00
|
|
|
if (anon_vma)
|
2011-03-23 07:32:46 +08:00
|
|
|
put_anon_vma(anon_vma);
|
2006-06-23 17:03:51 +08:00
|
|
|
unlock_page(page);
|
2011-11-01 08:06:57 +08:00
|
|
|
out:
|
|
|
|
return rc;
|
|
|
|
}
|
2006-06-23 17:03:53 +08:00
|
|
|
|
2015-04-15 06:44:22 +08:00
|
|
|
/*
|
|
|
|
* gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work
|
|
|
|
* around it.
|
|
|
|
*/
|
|
|
|
#if (GCC_VERSION >= 40700 && GCC_VERSION < 40900) && defined(CONFIG_ARM)
|
|
|
|
#define ICE_noinline noinline
|
|
|
|
#else
|
|
|
|
#define ICE_noinline
|
|
|
|
#endif
|
|
|
|
|
2011-11-01 08:06:57 +08:00
|
|
|
/*
|
|
|
|
* Obtain the lock on page, remove all ptes and migrate the page
|
|
|
|
* to the newly allocated page in newpage.
|
|
|
|
*/
|
2015-04-15 06:44:22 +08:00
|
|
|
static ICE_noinline int unmap_and_move(new_page_t get_new_page,
|
|
|
|
free_page_t put_new_page,
|
|
|
|
unsigned long private, struct page *page,
|
mm: soft-offline: don't free target page in successful page migration
Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
RSP <ffff88007a117d28>
---[ end trace 53926436e76d1f35 ]---
When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.
This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.
Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.
We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 07:56:50 +08:00
|
|
|
int force, enum migrate_mode mode,
|
|
|
|
enum migrate_reason reason)
|
2011-11-01 08:06:57 +08:00
|
|
|
{
|
2015-11-06 10:49:46 +08:00
|
|
|
int rc = MIGRATEPAGE_SUCCESS;
|
2011-11-01 08:06:57 +08:00
|
|
|
int *result = NULL;
|
2015-11-06 10:49:46 +08:00
|
|
|
struct page *newpage;
|
2011-11-01 08:06:57 +08:00
|
|
|
|
2015-11-06 10:49:46 +08:00
|
|
|
newpage = get_new_page(page, private, &result);
|
2011-11-01 08:06:57 +08:00
|
|
|
if (!newpage)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (page_count(page) == 1) {
|
|
|
|
/* page was freed from under us. So we are done. */
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-01-16 08:54:00 +08:00
|
|
|
if (unlikely(PageTransHuge(page))) {
|
|
|
|
lock_page(page);
|
|
|
|
rc = split_huge_page(page);
|
|
|
|
unlock_page(page);
|
|
|
|
if (rc)
|
2011-11-01 08:06:57 +08:00
|
|
|
goto out;
|
2016-01-16 08:54:00 +08:00
|
|
|
}
|
2011-11-01 08:06:57 +08:00
|
|
|
|
2013-02-23 08:35:14 +08:00
|
|
|
rc = __unmap_and_move(page, newpage, force, mode);
|
2016-03-16 05:56:18 +08:00
|
|
|
if (rc == MIGRATEPAGE_SUCCESS) {
|
2015-11-06 10:49:46 +08:00
|
|
|
put_new_page = NULL;
|
2016-03-16 05:56:18 +08:00
|
|
|
set_page_owner_migrate_reason(newpage, reason);
|
|
|
|
}
|
2012-12-12 08:02:42 +08:00
|
|
|
|
2011-11-01 08:06:57 +08:00
|
|
|
out:
|
2006-06-23 17:03:51 +08:00
|
|
|
if (rc != -EAGAIN) {
|
2011-11-01 08:06:57 +08:00
|
|
|
/*
|
|
|
|
* A page that has been migrated has all references
|
|
|
|
* removed and will be freed. A page that has not been
|
|
|
|
* migrated will have kepts its references and be
|
|
|
|
* restored.
|
|
|
|
*/
|
|
|
|
list_del(&page->lru);
|
2009-09-22 08:01:37 +08:00
|
|
|
dec_zone_page_state(page, NR_ISOLATED_ANON +
|
2009-09-22 08:02:59 +08:00
|
|
|
page_is_file_cache(page));
|
mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*
The race condition addressed in commit add05cecef80 ("mm: soft-offline:
don't free target page in successful page migration") was not closed
completely, because that can happen not only for soft-offline, but also
for hard-offline. Consider that a slab page is about to be freed into
buddy pool, and then an uncorrected memory error hits the page just
after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
necessary because the data on the affected page is not consumed.
To solve it, this patch drops __PG_HWPOISON from page flag checks at
allocation/free time. I think it's justified because __PG_HWPOISON
flags is defined to prevent the page from being reused, and setting it
outside the page's alloc-free cycle is a designed behavior (not a bug.)
For recent months, I was annoyed about BUG_ON when soft-offlined page
remains on lru cache list for a while, which is avoided by calling
put_page() instead of putback_lru_page() in page migration's success
path. This means that this patch reverts a major change from commit
add05cecef80 about the new refcounting rule of soft-offlined pages, so
"reuse window" revives. This will be closed by a subsequent patch.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 06:47:08 +08:00
|
|
|
/* Soft-offlined page shouldn't go through lru cache list */
|
2016-04-29 07:18:44 +08:00
|
|
|
if (reason == MR_MEMORY_FAILURE && rc == MIGRATEPAGE_SUCCESS) {
|
|
|
|
/*
|
|
|
|
* With this release, we free successfully migrated
|
|
|
|
* page and set PG_HWPoison on just freed page
|
|
|
|
* intentionally. Although it's rather weird, it's how
|
|
|
|
* HWPoison flag works at the moment.
|
|
|
|
*/
|
mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*
The race condition addressed in commit add05cecef80 ("mm: soft-offline:
don't free target page in successful page migration") was not closed
completely, because that can happen not only for soft-offline, but also
for hard-offline. Consider that a slab page is about to be freed into
buddy pool, and then an uncorrected memory error hits the page just
after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
necessary because the data on the affected page is not consumed.
To solve it, this patch drops __PG_HWPOISON from page flag checks at
allocation/free time. I think it's justified because __PG_HWPOISON
flags is defined to prevent the page from being reused, and setting it
outside the page's alloc-free cycle is a designed behavior (not a bug.)
For recent months, I was annoyed about BUG_ON when soft-offlined page
remains on lru cache list for a while, which is avoided by calling
put_page() instead of putback_lru_page() in page migration's success
path. This means that this patch reverts a major change from commit
add05cecef80 about the new refcounting rule of soft-offlined pages, so
"reuse window" revives. This will be closed by a subsequent patch.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dean Nelson <dnelson@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 06:47:08 +08:00
|
|
|
put_page(page);
|
mm/hwpoison: fix race between soft_offline_page and unpoison_memory
Wanpeng Li reported a race between soft_offline_page() and
unpoison_memory(), which causes the following kernel panic:
BUG: Bad page state in process bash pfn:97000
page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
flags: 0x1fffff80080048(uptodate|active|swapbacked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x40(active)
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
Call Trace:
dump_stack+0x48/0x5c
bad_page+0xe6/0x140
free_pages_prepare+0x2f9/0x320
? uncharge_list+0xdd/0x100
free_hot_cold_page+0x40/0x170
__put_single_page+0x20/0x30
put_page+0x25/0x40
unmap_and_move+0x1a6/0x1f0
migrate_pages+0x100/0x1d0
? kill_procs+0x100/0x100
? unlock_page+0x6f/0x90
__soft_offline_page+0x127/0x2a0
soft_offline_page+0xa6/0x200
This race is explained like below:
CPU0 CPU1
soft_offline_page
__soft_offline_page
TestSetPageHWPoison
unpoison_memory
PageHWPoison check (true)
TestClearPageHWPoison
put_page -> release refcount held by get_hwpoison_page in unpoison_memory
put_page -> release refcount held by isolate_lru_page in __soft_offline_page
migrate_pages
The second put_page() releases refcount held by isolate_lru_page() which
will lead to unmap_and_move() releases the last refcount of page and w/
mapcount still 1 since try_to_unmap() is not called if there is only one
user map the page. Anyway, the page refcount and mapcount will still
mess if the page is mapped by multiple users.
This race was introduced by commit 4491f71260 ("mm/memory-failure: set
PageHWPoison before migrate_pages()"), which focuses on preventing the
reuse of successfully migrated page. Before this commit we prevent the
reuse by changing the migratetype to MIGRATE_ISOLATE during soft
offlining, which has the following problems, so simply reverting the
commit is not a best option:
1) it doesn't eliminate the reuse completely, because
set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
target page if the pageblock of the page contains one or more
unmovable pages (i.e. has_unmovable_pages() returns true).
2) the original code changes migratetype to MIGRATE_ISOLATE
forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
regardless of the original migratetype state, which could impact
other subsystems like memory hotplug or compaction.
This patch moves PageSetHWPoison just after put_page() in
unmap_and_move(), which closes up the reported race window and minimizes
another race window b/w SetPageHWPoison and reallocation (which causes
the reuse of soft-offlined page.) The latter race window still exists
but it's acceptable, because it's rare and effectively the same as
ordinary "containment failure" case even if it happens, so keep the
window open is acceptable.
Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 06:03:27 +08:00
|
|
|
if (!test_set_page_hwpoison(page))
|
|
|
|
num_poisoned_pages_inc();
|
|
|
|
} else
|
mm: soft-offline: don't free target page in successful page migration
Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
RSP <ffff88007a117d28>
---[ end trace 53926436e76d1f35 ]---
When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.
This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.
Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.
We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 07:56:50 +08:00
|
|
|
putback_lru_page(page);
|
2006-06-23 17:03:51 +08:00
|
|
|
}
|
2014-06-05 07:08:25 +08:00
|
|
|
|
2006-06-23 17:03:53 +08:00
|
|
|
/*
|
2014-06-05 07:08:25 +08:00
|
|
|
* If migration was not successful and there's a freeing callback, use
|
|
|
|
* it. Otherwise, putback_lru_page() will drop the reference grabbed
|
|
|
|
* during isolation.
|
2006-06-23 17:03:53 +08:00
|
|
|
*/
|
2015-11-06 10:50:02 +08:00
|
|
|
if (put_new_page)
|
2014-06-05 07:08:25 +08:00
|
|
|
put_new_page(newpage, private);
|
2015-11-06 10:50:02 +08:00
|
|
|
else if (unlikely(__is_movable_balloon_page(newpage))) {
|
2014-10-10 06:29:27 +08:00
|
|
|
/* drop our reference, page already in the balloon */
|
|
|
|
put_page(newpage);
|
mm: fix direct reclaim writeback regression
Shortly before 3.16-rc1, Dave Jones reported:
WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
xfs_vm_writepage+0x5ce/0x630 [xfs]()
CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
Call Trace:
xfs_vm_writepage+0x5ce/0x630 [xfs]
shrink_page_list+0x8f9/0xb90
shrink_inactive_list+0x253/0x510
shrink_lruvec+0x563/0x6c0
shrink_zone+0x3b/0x100
shrink_zones+0x1f1/0x3c0
try_to_free_pages+0x164/0x380
__alloc_pages_nodemask+0x822/0xc90
alloc_pages_vma+0xaf/0x1c0
handle_mm_fault+0xa31/0xc50
etc.
970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
971 PF_MEMALLOC))
I did not respond at the time, because a glance at the PageDirty block
in shrink_page_list() quickly shows that this is impossible: we don't do
writeback on file pages (other than tmpfs) from direct reclaim nowadays.
Dave was hallucinating, but it would have been disrespectful to say so.
However, my own /var/log/messages now shows similar complaints
WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
from stressing some mmotm trees during July.
Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
to mapping->a_ops->writepage()?
Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
page freeing callback") has provided such a way to compaction: if
migrating a SwapBacked page fails, its newpage may be put back on the
list for later use with PageSwapBacked still set, and nothing will clear
it.
Whether that can do anything worse than issue WARN_ON_ONCEs, and get
some statistics wrong, is unclear: easier to fix than to think through
the consequences.
Fixing it here, before the put_new_page(), addresses the bug directly,
but is probably the worst place to fix it. Page migration is doing too
many parts of the job on too many levels: fixing it in
move_to_new_page() to complement its SetPageSwapBacked would be
preferable, except why is it (and newpage->mapping and newpage->index)
done there, rather than down in migrate_page_move_mapping(), once we are
sure of success? Not a cleanup to get into right now, especially not
with memcg cleanups coming in 3.17.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-27 03:58:23 +08:00
|
|
|
} else
|
2014-06-05 07:08:25 +08:00
|
|
|
putback_lru_page(newpage);
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
if (result) {
|
|
|
|
if (rc)
|
|
|
|
*result = rc;
|
|
|
|
else
|
|
|
|
*result = page_to_nid(newpage);
|
|
|
|
}
|
2006-06-23 17:03:51 +08:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
/*
|
|
|
|
* Counterpart of unmap_and_move_page() for hugepage migration.
|
|
|
|
*
|
|
|
|
* This function doesn't wait the completion of hugepage I/O
|
|
|
|
* because there is no race between I/O and migration for hugepage.
|
|
|
|
* Note that currently hugepage I/O occurs only in direct I/O
|
|
|
|
* where no lock is held and PG_writeback is irrelevant,
|
|
|
|
* and writeback status of all subpages are counted in the reference
|
|
|
|
* count of the head page (i.e. if all subpages of a 2MB hugepage are
|
|
|
|
* under direct I/O, the reference of the head page is 512 and a bit more.)
|
|
|
|
* This means that when we try to migrate hugepage whose subpages are
|
|
|
|
* doing direct I/O, some references remain after try_to_unmap() and
|
|
|
|
* hugepage migration fails without data corruption.
|
|
|
|
*
|
|
|
|
* There is also no race when direct I/O is issued on the page under migration,
|
|
|
|
* because then pte is replaced with migration swap entry and direct I/O code
|
|
|
|
* will wait in the page fault for migration to complete.
|
|
|
|
*/
|
|
|
|
static int unmap_and_move_huge_page(new_page_t get_new_page,
|
2014-06-05 07:08:25 +08:00
|
|
|
free_page_t put_new_page, unsigned long private,
|
|
|
|
struct page *hpage, int force,
|
2016-03-16 05:56:18 +08:00
|
|
|
enum migrate_mode mode, int reason)
|
2010-09-08 09:19:35 +08:00
|
|
|
{
|
2015-11-06 10:49:46 +08:00
|
|
|
int rc = -EAGAIN;
|
2010-09-08 09:19:35 +08:00
|
|
|
int *result = NULL;
|
2014-12-13 08:56:19 +08:00
|
|
|
int page_was_mapped = 0;
|
2014-01-22 07:51:15 +08:00
|
|
|
struct page *new_hpage;
|
2010-09-08 09:19:35 +08:00
|
|
|
struct anon_vma *anon_vma = NULL;
|
|
|
|
|
2013-09-12 05:22:11 +08:00
|
|
|
/*
|
|
|
|
* Movability of hugepages depends on architectures and hugepage size.
|
|
|
|
* This check is necessary because some callers of hugepage migration
|
|
|
|
* like soft offline and memory hotremove don't walk through page
|
|
|
|
* tables or check whether the hugepage is pmd-based or not before
|
|
|
|
* kicking migration.
|
|
|
|
*/
|
2014-06-05 07:10:56 +08:00
|
|
|
if (!hugepage_migration_supported(page_hstate(hpage))) {
|
2014-01-22 07:51:15 +08:00
|
|
|
putback_active_hugepage(hpage);
|
2013-09-12 05:22:11 +08:00
|
|
|
return -ENOSYS;
|
2014-01-22 07:51:15 +08:00
|
|
|
}
|
2013-09-12 05:22:11 +08:00
|
|
|
|
2014-01-22 07:51:15 +08:00
|
|
|
new_hpage = get_new_page(hpage, private, &result);
|
2010-09-08 09:19:35 +08:00
|
|
|
if (!new_hpage)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (!trylock_page(hpage)) {
|
2012-01-13 09:19:43 +08:00
|
|
|
if (!force || mode != MIGRATE_SYNC)
|
2010-09-08 09:19:35 +08:00
|
|
|
goto out;
|
|
|
|
lock_page(hpage);
|
|
|
|
}
|
|
|
|
|
2011-05-25 08:12:10 +08:00
|
|
|
if (PageAnon(hpage))
|
|
|
|
anon_vma = page_get_anon_vma(hpage);
|
2010-09-08 09:19:35 +08:00
|
|
|
|
2015-11-06 10:49:49 +08:00
|
|
|
if (unlikely(!trylock_page(new_hpage)))
|
|
|
|
goto put_anon;
|
|
|
|
|
2014-12-13 08:56:19 +08:00
|
|
|
if (page_mapped(hpage)) {
|
|
|
|
try_to_unmap(hpage,
|
|
|
|
TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
|
|
|
|
page_was_mapped = 1;
|
|
|
|
}
|
2010-09-08 09:19:35 +08:00
|
|
|
|
|
|
|
if (!page_mapped(hpage))
|
2015-11-06 10:49:53 +08:00
|
|
|
rc = move_to_new_page(new_hpage, hpage, mode);
|
2010-09-08 09:19:35 +08:00
|
|
|
|
2015-11-06 10:49:53 +08:00
|
|
|
if (page_was_mapped)
|
|
|
|
remove_migration_ptes(hpage,
|
2016-03-18 05:20:07 +08:00
|
|
|
rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
|
2010-09-08 09:19:35 +08:00
|
|
|
|
2015-11-06 10:49:49 +08:00
|
|
|
unlock_page(new_hpage);
|
|
|
|
|
|
|
|
put_anon:
|
2011-01-14 07:47:31 +08:00
|
|
|
if (anon_vma)
|
2011-03-23 07:32:46 +08:00
|
|
|
put_anon_vma(anon_vma);
|
2012-08-01 07:42:27 +08:00
|
|
|
|
2015-11-06 10:49:46 +08:00
|
|
|
if (rc == MIGRATEPAGE_SUCCESS) {
|
2012-08-01 07:42:27 +08:00
|
|
|
hugetlb_cgroup_migrate(hpage, new_hpage);
|
2015-11-06 10:49:46 +08:00
|
|
|
put_new_page = NULL;
|
2016-03-16 05:56:18 +08:00
|
|
|
set_page_owner_migrate_reason(new_hpage, reason);
|
2015-11-06 10:49:46 +08:00
|
|
|
}
|
2012-08-01 07:42:27 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
unlock_page(hpage);
|
2011-12-09 06:34:20 +08:00
|
|
|
out:
|
2013-09-12 05:22:01 +08:00
|
|
|
if (rc != -EAGAIN)
|
|
|
|
putback_active_hugepage(hpage);
|
2014-06-05 07:08:25 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If migration was not successful and there's a freeing callback, use
|
|
|
|
* it. Otherwise, put_page() will drop the reference grabbed during
|
|
|
|
* isolation.
|
|
|
|
*/
|
2015-11-06 10:49:46 +08:00
|
|
|
if (put_new_page)
|
2014-06-05 07:08:25 +08:00
|
|
|
put_new_page(new_hpage, private);
|
|
|
|
else
|
2015-09-23 05:59:14 +08:00
|
|
|
putback_active_hugepage(new_hpage);
|
2014-06-05 07:08:25 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
if (result) {
|
|
|
|
if (rc)
|
|
|
|
*result = rc;
|
|
|
|
else
|
|
|
|
*result = page_to_nid(new_hpage);
|
|
|
|
}
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
2013-04-30 06:08:16 +08:00
|
|
|
* migrate_pages - migrate the pages specified in a list, to the free pages
|
|
|
|
* supplied as the target for the page migration
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
2013-04-30 06:08:16 +08:00
|
|
|
* @from: The list of pages to be migrated.
|
|
|
|
* @get_new_page: The function used to allocate free pages to be used
|
|
|
|
* as the target of the page migration.
|
2014-06-05 07:08:25 +08:00
|
|
|
* @put_new_page: The function used to free target pages if migration
|
|
|
|
* fails, or NULL if no special handling is necessary.
|
2013-04-30 06:08:16 +08:00
|
|
|
* @private: Private data to be passed on to get_new_page()
|
|
|
|
* @mode: The migration mode that specifies the constraints for
|
|
|
|
* page migration, if any.
|
|
|
|
* @reason: The reason for page migration.
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
2013-04-30 06:08:16 +08:00
|
|
|
* The function returns after 10 attempts or if no pages are movable any more
|
|
|
|
* because the list has become empty or no retryable pages exist any more.
|
2015-11-06 10:49:43 +08:00
|
|
|
* The caller should call putback_movable_pages() to return pages to the LRU
|
2011-01-26 07:07:26 +08:00
|
|
|
* or free list only if ret != 0.
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
2013-04-30 06:08:16 +08:00
|
|
|
* Returns the number of pages that were not migrated, or an error code.
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
2013-02-23 08:35:14 +08:00
|
|
|
int migrate_pages(struct list_head *from, new_page_t get_new_page,
|
2014-06-05 07:08:25 +08:00
|
|
|
free_page_t put_new_page, unsigned long private,
|
|
|
|
enum migrate_mode mode, int reason)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
2006-06-23 17:03:51 +08:00
|
|
|
int retry = 1;
|
2006-03-22 16:09:12 +08:00
|
|
|
int nr_failed = 0;
|
2012-10-19 17:46:20 +08:00
|
|
|
int nr_succeeded = 0;
|
2006-03-22 16:09:12 +08:00
|
|
|
int pass = 0;
|
|
|
|
struct page *page;
|
|
|
|
struct page *page2;
|
|
|
|
int swapwrite = current->flags & PF_SWAPWRITE;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (!swapwrite)
|
|
|
|
current->flags |= PF_SWAPWRITE;
|
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
for(pass = 0; pass < 10 && retry; pass++) {
|
|
|
|
retry = 0;
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
list_for_each_entry_safe(page, page2, from, lru) {
|
|
|
|
cond_resched();
|
2006-06-23 17:03:33 +08:00
|
|
|
|
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 05:21:59 +08:00
|
|
|
if (PageHuge(page))
|
|
|
|
rc = unmap_and_move_huge_page(get_new_page,
|
2014-06-05 07:08:25 +08:00
|
|
|
put_new_page, private, page,
|
2016-03-16 05:56:18 +08:00
|
|
|
pass > 2, mode, reason);
|
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 05:21:59 +08:00
|
|
|
else
|
2014-06-05 07:08:25 +08:00
|
|
|
rc = unmap_and_move(get_new_page, put_new_page,
|
mm: soft-offline: don't free target page in successful page migration
Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
RSP <ffff88007a117d28>
---[ end trace 53926436e76d1f35 ]---
When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.
This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.
Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.
We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 07:56:50 +08:00
|
|
|
private, page, pass > 2, mode,
|
|
|
|
reason);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
switch(rc) {
|
2006-06-23 17:03:53 +08:00
|
|
|
case -ENOMEM:
|
|
|
|
goto out;
|
2006-06-23 17:03:51 +08:00
|
|
|
case -EAGAIN:
|
2006-06-23 17:03:33 +08:00
|
|
|
retry++;
|
2006-06-23 17:03:51 +08:00
|
|
|
break;
|
2012-12-12 08:02:31 +08:00
|
|
|
case MIGRATEPAGE_SUCCESS:
|
2012-10-19 17:46:20 +08:00
|
|
|
nr_succeeded++;
|
2006-06-23 17:03:51 +08:00
|
|
|
break;
|
|
|
|
default:
|
2014-01-22 07:51:14 +08:00
|
|
|
/*
|
|
|
|
* Permanent failure (-EBUSY, -ENOSYS, etc.):
|
|
|
|
* unlike -EAGAIN case, the failed page is
|
|
|
|
* removed from migration page list and not
|
|
|
|
* retried in the next outer loop.
|
|
|
|
*/
|
2006-06-23 17:03:33 +08:00
|
|
|
nr_failed++;
|
2006-06-23 17:03:51 +08:00
|
|
|
break;
|
2006-06-23 17:03:33 +08:00
|
|
|
}
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
}
|
2015-11-06 10:47:03 +08:00
|
|
|
nr_failed += retry;
|
|
|
|
rc = nr_failed;
|
2006-06-23 17:03:53 +08:00
|
|
|
out:
|
2012-10-19 17:46:20 +08:00
|
|
|
if (nr_succeeded)
|
|
|
|
count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
|
|
|
|
if (nr_failed)
|
|
|
|
count_vm_events(PGMIGRATE_FAIL, nr_failed);
|
2012-10-19 21:07:31 +08:00
|
|
|
trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
if (!swapwrite)
|
|
|
|
current->flags &= ~PF_SWAPWRITE;
|
|
|
|
|
2012-12-12 08:02:31 +08:00
|
|
|
return rc;
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
2006-06-23 17:03:53 +08:00
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
/*
|
|
|
|
* Move a list of individual pages
|
|
|
|
*/
|
|
|
|
struct page_to_node {
|
|
|
|
unsigned long addr;
|
|
|
|
struct page *page;
|
|
|
|
int node;
|
|
|
|
int status;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct page *new_page_node(struct page *p, unsigned long private,
|
|
|
|
int **result)
|
|
|
|
{
|
|
|
|
struct page_to_node *pm = (struct page_to_node *)private;
|
|
|
|
|
|
|
|
while (pm->node != MAX_NUMNODES && pm->page != p)
|
|
|
|
pm++;
|
|
|
|
|
|
|
|
if (pm->node == MAX_NUMNODES)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*result = &pm->status;
|
|
|
|
|
2013-09-12 05:22:04 +08:00
|
|
|
if (PageHuge(p))
|
|
|
|
return alloc_huge_page_node(page_hstate(compound_head(p)),
|
|
|
|
pm->node);
|
|
|
|
else
|
mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 06:03:50 +08:00
|
|
|
return __alloc_pages_node(pm->node,
|
2014-03-11 06:49:43 +08:00
|
|
|
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move a set of pages as indicated in the pm array. The addr
|
|
|
|
* field must be set to the virtual address of the page to be moved
|
|
|
|
* and the node number must contain a valid target node.
|
2008-10-19 11:27:17 +08:00
|
|
|
* The pm array ends with node = MAX_NUMNODES.
|
2006-06-23 17:03:55 +08:00
|
|
|
*/
|
2008-10-19 11:27:17 +08:00
|
|
|
static int do_move_page_to_node_array(struct mm_struct *mm,
|
|
|
|
struct page_to_node *pm,
|
|
|
|
int migrate_all)
|
2006-06-23 17:03:55 +08:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct page_to_node *pp;
|
|
|
|
LIST_HEAD(pagelist);
|
|
|
|
|
|
|
|
down_read(&mm->mmap_sem);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build a list of pages to migrate
|
|
|
|
*/
|
|
|
|
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
err = -EFAULT;
|
|
|
|
vma = find_vma(mm, pp->addr);
|
2010-10-27 05:22:07 +08:00
|
|
|
if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
|
2006-06-23 17:03:55 +08:00
|
|
|
goto set_status;
|
|
|
|
|
2015-09-05 06:47:53 +08:00
|
|
|
/* FOLL_DUMP to ignore special (like zero) pages */
|
|
|
|
page = follow_page(vma, pp->addr,
|
|
|
|
FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
|
2008-06-21 02:18:25 +08:00
|
|
|
|
|
|
|
err = PTR_ERR(page);
|
|
|
|
if (IS_ERR(page))
|
|
|
|
goto set_status;
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
err = -ENOENT;
|
|
|
|
if (!page)
|
|
|
|
goto set_status;
|
|
|
|
|
|
|
|
pp->page = page;
|
|
|
|
err = page_to_nid(page);
|
|
|
|
|
|
|
|
if (err == pp->node)
|
|
|
|
/*
|
|
|
|
* Node already in the right place
|
|
|
|
*/
|
|
|
|
goto put_and_set;
|
|
|
|
|
|
|
|
err = -EACCES;
|
|
|
|
if (page_mapcount(page) > 1 &&
|
|
|
|
!migrate_all)
|
|
|
|
goto put_and_set;
|
|
|
|
|
2013-09-12 05:22:04 +08:00
|
|
|
if (PageHuge(page)) {
|
mm/hugetlb: take page table lock in follow_huge_pmd()
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 07:25:22 +08:00
|
|
|
if (PageHead(page))
|
|
|
|
isolate_huge_page(page, &pagelist);
|
2013-09-12 05:22:04 +08:00
|
|
|
goto put_and_set;
|
|
|
|
}
|
|
|
|
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
err = isolate_lru_page(page);
|
2009-12-15 09:58:11 +08:00
|
|
|
if (!err) {
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
list_add_tail(&page->lru, &pagelist);
|
2009-12-15 09:58:11 +08:00
|
|
|
inc_zone_page_state(page, NR_ISOLATED_ANON +
|
|
|
|
page_is_file_cache(page));
|
|
|
|
}
|
2006-06-23 17:03:55 +08:00
|
|
|
put_and_set:
|
|
|
|
/*
|
|
|
|
* Either remove the duplicate refcount from
|
|
|
|
* isolate_lru_page() or drop the page ref if it was
|
|
|
|
* not isolated.
|
|
|
|
*/
|
|
|
|
put_page(page);
|
|
|
|
set_status:
|
|
|
|
pp->status = err;
|
|
|
|
}
|
|
|
|
|
2008-10-19 11:27:15 +08:00
|
|
|
err = 0;
|
2010-10-27 05:21:29 +08:00
|
|
|
if (!list_empty(&pagelist)) {
|
2014-06-05 07:08:25 +08:00
|
|
|
err = migrate_pages(&pagelist, new_page_node, NULL,
|
2013-02-23 08:35:14 +08:00
|
|
|
(unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
|
2010-10-27 05:21:29 +08:00
|
|
|
if (err)
|
2013-09-12 05:22:04 +08:00
|
|
|
putback_movable_pages(&pagelist);
|
2010-10-27 05:21:29 +08:00
|
|
|
}
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2008-10-19 11:27:17 +08:00
|
|
|
/*
|
|
|
|
* Migrate an array of page address onto an array of nodes and fill
|
|
|
|
* the corresponding array of status.
|
|
|
|
*/
|
2012-03-22 07:34:06 +08:00
|
|
|
static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
|
2008-10-19 11:27:17 +08:00
|
|
|
unsigned long nr_pages,
|
|
|
|
const void __user * __user *pages,
|
|
|
|
const int __user *nodes,
|
|
|
|
int __user *status, int flags)
|
|
|
|
{
|
2009-01-07 06:38:57 +08:00
|
|
|
struct page_to_node *pm;
|
|
|
|
unsigned long chunk_nr_pages;
|
|
|
|
unsigned long chunk_start;
|
|
|
|
int err;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
err = -ENOMEM;
|
|
|
|
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
|
|
|
|
if (!pm)
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out;
|
2009-06-17 06:32:43 +08:00
|
|
|
|
|
|
|
migrate_prep();
|
|
|
|
|
2008-10-19 11:27:17 +08:00
|
|
|
/*
|
2009-01-07 06:38:57 +08:00
|
|
|
* Store a chunk of page_to_node array in a page,
|
|
|
|
* but keep the last one as a marker
|
2008-10-19 11:27:17 +08:00
|
|
|
*/
|
2009-01-07 06:38:57 +08:00
|
|
|
chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
for (chunk_start = 0;
|
|
|
|
chunk_start < nr_pages;
|
|
|
|
chunk_start += chunk_nr_pages) {
|
|
|
|
int j;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
if (chunk_start + chunk_nr_pages > nr_pages)
|
|
|
|
chunk_nr_pages = nr_pages - chunk_start;
|
|
|
|
|
|
|
|
/* fill the chunk pm with addrs and nodes from user-space */
|
|
|
|
for (j = 0; j < chunk_nr_pages; j++) {
|
|
|
|
const void __user *p;
|
2008-10-19 11:27:17 +08:00
|
|
|
int node;
|
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
err = -EFAULT;
|
|
|
|
if (get_user(p, pages + j + chunk_start))
|
|
|
|
goto out_pm;
|
|
|
|
pm[j].addr = (unsigned long) p;
|
|
|
|
|
|
|
|
if (get_user(node, nodes + j + chunk_start))
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out_pm;
|
|
|
|
|
|
|
|
err = -ENODEV;
|
2010-02-06 08:16:50 +08:00
|
|
|
if (node < 0 || node >= MAX_NUMNODES)
|
|
|
|
goto out_pm;
|
|
|
|
|
2012-12-13 05:51:30 +08:00
|
|
|
if (!node_state(node, N_MEMORY))
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out_pm;
|
|
|
|
|
|
|
|
err = -EACCES;
|
|
|
|
if (!node_isset(node, task_nodes))
|
|
|
|
goto out_pm;
|
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
pm[j].node = node;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* End marker for this chunk */
|
|
|
|
pm[chunk_nr_pages].node = MAX_NUMNODES;
|
|
|
|
|
|
|
|
/* Migrate this chunk */
|
|
|
|
err = do_move_page_to_node_array(mm, pm,
|
|
|
|
flags & MPOL_MF_MOVE_ALL);
|
|
|
|
if (err < 0)
|
|
|
|
goto out_pm;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
|
|
|
/* Return status information */
|
2009-01-07 06:38:57 +08:00
|
|
|
for (j = 0; j < chunk_nr_pages; j++)
|
|
|
|
if (put_user(pm[j].status, status + j + chunk_start)) {
|
2008-10-19 11:27:17 +08:00
|
|
|
err = -EFAULT;
|
2009-01-07 06:38:57 +08:00
|
|
|
goto out_pm;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
err = 0;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
|
|
|
out_pm:
|
2009-01-07 06:38:57 +08:00
|
|
|
free_page((unsigned long)pm);
|
2008-10-19 11:27:17 +08:00
|
|
|
out:
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
/*
|
2008-10-19 11:27:16 +08:00
|
|
|
* Determine the nodes of an array of pages and store it in an array of status.
|
2006-06-23 17:03:55 +08:00
|
|
|
*/
|
2008-12-10 05:14:23 +08:00
|
|
|
static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
|
|
|
|
const void __user **pages, int *status)
|
2006-06-23 17:03:55 +08:00
|
|
|
{
|
2008-10-19 11:27:16 +08:00
|
|
|
unsigned long i;
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
down_read(&mm->mmap_sem);
|
|
|
|
|
2008-10-19 11:27:16 +08:00
|
|
|
for (i = 0; i < nr_pages; i++) {
|
2008-12-10 05:14:23 +08:00
|
|
|
unsigned long addr = (unsigned long)(*pages);
|
2006-06-23 17:03:55 +08:00
|
|
|
struct vm_area_struct *vma;
|
|
|
|
struct page *page;
|
2008-12-16 15:06:43 +08:00
|
|
|
int err = -EFAULT;
|
2008-10-19 11:27:16 +08:00
|
|
|
|
|
|
|
vma = find_vma(mm, addr);
|
2010-10-27 05:22:07 +08:00
|
|
|
if (!vma || addr < vma->vm_start)
|
2006-06-23 17:03:55 +08:00
|
|
|
goto set_status;
|
|
|
|
|
2015-09-05 06:47:53 +08:00
|
|
|
/* FOLL_DUMP to ignore special (like zero) pages */
|
|
|
|
page = follow_page(vma, addr, FOLL_DUMP);
|
2008-06-21 02:18:25 +08:00
|
|
|
|
|
|
|
err = PTR_ERR(page);
|
|
|
|
if (IS_ERR(page))
|
|
|
|
goto set_status;
|
|
|
|
|
2015-09-05 06:47:53 +08:00
|
|
|
err = page ? page_to_nid(page) : -ENOENT;
|
2006-06-23 17:03:55 +08:00
|
|
|
set_status:
|
2008-12-10 05:14:23 +08:00
|
|
|
*status = err;
|
|
|
|
|
|
|
|
pages++;
|
|
|
|
status++;
|
|
|
|
}
|
|
|
|
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine the nodes of a user array of pages and store it in
|
|
|
|
* a user array of status.
|
|
|
|
*/
|
|
|
|
static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
|
|
|
|
const void __user * __user *pages,
|
|
|
|
int __user *status)
|
|
|
|
{
|
|
|
|
#define DO_PAGES_STAT_CHUNK_NR 16
|
|
|
|
const void __user *chunk_pages[DO_PAGES_STAT_CHUNK_NR];
|
|
|
|
int chunk_status[DO_PAGES_STAT_CHUNK_NR];
|
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
while (nr_pages) {
|
|
|
|
unsigned long chunk_nr;
|
2008-12-10 05:14:23 +08:00
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
chunk_nr = nr_pages;
|
|
|
|
if (chunk_nr > DO_PAGES_STAT_CHUNK_NR)
|
|
|
|
chunk_nr = DO_PAGES_STAT_CHUNK_NR;
|
|
|
|
|
|
|
|
if (copy_from_user(chunk_pages, pages, chunk_nr * sizeof(*chunk_pages)))
|
|
|
|
break;
|
2008-12-10 05:14:23 +08:00
|
|
|
|
|
|
|
do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
|
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
|
|
|
|
break;
|
2006-06-23 17:03:55 +08:00
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
pages += chunk_nr;
|
|
|
|
status += chunk_nr;
|
|
|
|
nr_pages -= chunk_nr;
|
|
|
|
}
|
|
|
|
return nr_pages ? -EFAULT : 0;
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move a list of pages in the address space of the currently executing
|
|
|
|
* process.
|
|
|
|
*/
|
2009-01-14 21:14:30 +08:00
|
|
|
SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
|
|
|
|
const void __user * __user *, pages,
|
|
|
|
const int __user *, nodes,
|
|
|
|
int __user *, status, int, flags)
|
2006-06-23 17:03:55 +08:00
|
|
|
{
|
2008-11-14 07:39:19 +08:00
|
|
|
const struct cred *cred = current_cred(), *tcred;
|
2006-06-23 17:03:55 +08:00
|
|
|
struct task_struct *task;
|
|
|
|
struct mm_struct *mm;
|
2008-10-19 11:27:17 +08:00
|
|
|
int err;
|
2012-03-22 07:34:06 +08:00
|
|
|
nodemask_t task_nodes;
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
/* Check flags */
|
|
|
|
if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/* Find the mm_struct */
|
2011-02-26 06:44:13 +08:00
|
|
|
rcu_read_lock();
|
2007-10-19 14:40:16 +08:00
|
|
|
task = pid ? find_task_by_vpid(pid) : current;
|
2006-06-23 17:03:55 +08:00
|
|
|
if (!task) {
|
2011-02-26 06:44:13 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
return -ESRCH;
|
|
|
|
}
|
2012-03-22 07:34:06 +08:00
|
|
|
get_task_struct(task);
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if this process has the right to modify the specified
|
|
|
|
* process. The right exists if the process has administrative
|
|
|
|
* capabilities, superuser privileges or the same
|
|
|
|
* userid as the target process.
|
|
|
|
*/
|
2008-11-14 07:39:19 +08:00
|
|
|
tcred = __task_cred(task);
|
2012-03-13 06:48:24 +08:00
|
|
|
if (!uid_eq(cred->euid, tcred->suid) && !uid_eq(cred->euid, tcred->uid) &&
|
|
|
|
!uid_eq(cred->uid, tcred->suid) && !uid_eq(cred->uid, tcred->uid) &&
|
2006-06-23 17:03:55 +08:00
|
|
|
!capable(CAP_SYS_NICE)) {
|
2008-11-14 07:39:19 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
err = -EPERM;
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out;
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
2008-11-14 07:39:19 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
|
2006-06-23 17:04:02 +08:00
|
|
|
err = security_task_movememory(task);
|
|
|
|
if (err)
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out;
|
2006-06-23 17:04:02 +08:00
|
|
|
|
2012-03-22 07:34:06 +08:00
|
|
|
task_nodes = cpuset_mems_allowed(task);
|
|
|
|
mm = get_task_mm(task);
|
|
|
|
put_task_struct(task);
|
|
|
|
|
2012-04-26 07:01:53 +08:00
|
|
|
if (!mm)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (nodes)
|
|
|
|
err = do_pages_move(mm, task_nodes, nr_pages, pages,
|
|
|
|
nodes, status, flags);
|
|
|
|
else
|
|
|
|
err = do_pages_stat(mm, nr_pages, pages, status);
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
mmput(mm);
|
|
|
|
return err;
|
2012-03-22 07:34:06 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
put_task_struct(task);
|
|
|
|
return err;
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
|
|
|
|
2012-10-25 20:16:34 +08:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
|
|
|
/*
|
|
|
|
* Returns true if this is a safe migration target node for misplaced NUMA
|
|
|
|
* pages. Currently it only checks the watermarks which crude
|
|
|
|
*/
|
|
|
|
static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
|
2013-02-23 08:34:27 +08:00
|
|
|
unsigned long nr_migrate_pages)
|
2012-10-25 20:16:34 +08:00
|
|
|
{
|
|
|
|
int z;
|
|
|
|
for (z = pgdat->nr_zones - 1; z >= 0; z--) {
|
|
|
|
struct zone *zone = pgdat->node_zones + z;
|
|
|
|
|
|
|
|
if (!populated_zone(zone))
|
|
|
|
continue;
|
|
|
|
|
2013-09-12 05:22:36 +08:00
|
|
|
if (!zone_reclaimable(zone))
|
2012-10-25 20:16:34 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Avoid waking kswapd by allocating pages_to_migrate pages. */
|
|
|
|
if (!zone_watermark_ok(zone, 0,
|
|
|
|
high_wmark_pages(zone) +
|
|
|
|
nr_migrate_pages,
|
|
|
|
0, 0))
|
|
|
|
continue;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *alloc_misplaced_dst_page(struct page *page,
|
|
|
|
unsigned long data,
|
|
|
|
int **result)
|
|
|
|
{
|
|
|
|
int nid = (int) data;
|
|
|
|
struct page *newpage;
|
|
|
|
|
mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 06:03:50 +08:00
|
|
|
newpage = __alloc_pages_node(nid,
|
2014-03-11 06:49:43 +08:00
|
|
|
(GFP_HIGHUSER_MOVABLE |
|
|
|
|
__GFP_THISNODE | __GFP_NOMEMALLOC |
|
|
|
|
__GFP_NORETRY | __GFP_NOWARN) &
|
2016-02-27 07:19:31 +08:00
|
|
|
~__GFP_RECLAIM, 0);
|
2012-11-27 22:46:24 +08:00
|
|
|
|
2012-10-25 20:16:34 +08:00
|
|
|
return newpage;
|
|
|
|
}
|
|
|
|
|
2012-11-15 05:41:46 +08:00
|
|
|
/*
|
|
|
|
* page migration rate limiting control.
|
|
|
|
* Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
|
|
|
|
* window of time. Default here says do not migrate more than 1280M per second.
|
|
|
|
*/
|
|
|
|
static unsigned int migrate_interval_millisecs __read_mostly = 100;
|
|
|
|
static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
|
|
|
|
|
2012-11-19 20:35:47 +08:00
|
|
|
/* Returns true if the node is migrate rate-limited after the update */
|
2014-01-22 07:50:58 +08:00
|
|
|
static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
|
|
|
|
unsigned long nr_pages)
|
2012-10-25 20:16:34 +08:00
|
|
|
{
|
2012-11-15 05:41:46 +08:00
|
|
|
/*
|
|
|
|
* Rate-limit the amount of data that is being migrated to a node.
|
|
|
|
* Optimal placement is no good if the memory bus is saturated and
|
|
|
|
* all the time is being spent migrating!
|
|
|
|
*/
|
|
|
|
if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
|
2014-01-22 07:50:59 +08:00
|
|
|
spin_lock(&pgdat->numabalancing_migrate_lock);
|
2012-11-15 05:41:46 +08:00
|
|
|
pgdat->numabalancing_migrate_nr_pages = 0;
|
|
|
|
pgdat->numabalancing_migrate_next_window = jiffies +
|
|
|
|
msecs_to_jiffies(migrate_interval_millisecs);
|
2014-01-22 07:50:59 +08:00
|
|
|
spin_unlock(&pgdat->numabalancing_migrate_lock);
|
2012-11-15 05:41:46 +08:00
|
|
|
}
|
2014-01-22 07:51:01 +08:00
|
|
|
if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
|
|
|
|
trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
|
|
|
|
nr_pages);
|
2014-01-22 07:50:59 +08:00
|
|
|
return true;
|
2014-01-22 07:51:01 +08:00
|
|
|
}
|
2014-01-22 07:50:59 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is an unlocked non-atomic update so errors are possible.
|
|
|
|
* The consequences are failing to migrate when we potentiall should
|
|
|
|
* have which is not severe enough to warrant locking. If it is ever
|
|
|
|
* a problem, it can be converted to a per-cpu counter.
|
|
|
|
*/
|
|
|
|
pgdat->numabalancing_migrate_nr_pages += nr_pages;
|
|
|
|
return false;
|
2012-11-19 20:35:47 +08:00
|
|
|
}
|
|
|
|
|
2014-01-22 07:50:58 +08:00
|
|
|
static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
|
2012-11-19 20:35:47 +08:00
|
|
|
{
|
2013-02-23 08:34:33 +08:00
|
|
|
int page_lru;
|
2012-11-15 05:41:46 +08:00
|
|
|
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
|
2013-02-23 08:34:27 +08:00
|
|
|
|
2012-10-25 20:16:34 +08:00
|
|
|
/* Avoid migrating to a node that is nearly full */
|
2013-02-23 08:34:33 +08:00
|
|
|
if (!migrate_balanced_pgdat(pgdat, 1UL << compound_order(page)))
|
|
|
|
return 0;
|
2012-10-25 20:16:34 +08:00
|
|
|
|
2013-02-23 08:34:33 +08:00
|
|
|
if (isolate_lru_page(page))
|
|
|
|
return 0;
|
2012-10-25 20:16:34 +08:00
|
|
|
|
2013-02-23 08:34:33 +08:00
|
|
|
/*
|
|
|
|
* migrate_misplaced_transhuge_page() skips page migration's usual
|
|
|
|
* check on page_count(), so we must do it here, now that the page
|
|
|
|
* has been isolated: a GUP pin, or any other pin, prevents migration.
|
|
|
|
* The expected page count is 3: 1 for page's mapcount and 1 for the
|
|
|
|
* caller's pin and 1 for the reference taken by isolate_lru_page().
|
|
|
|
*/
|
|
|
|
if (PageTransHuge(page) && page_count(page) != 3) {
|
|
|
|
putback_lru_page(page);
|
|
|
|
return 0;
|
2012-10-25 20:16:34 +08:00
|
|
|
}
|
|
|
|
|
2013-02-23 08:34:33 +08:00
|
|
|
page_lru = page_is_file_cache(page);
|
|
|
|
mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
|
|
|
|
hpage_nr_pages(page));
|
|
|
|
|
2012-11-27 22:03:05 +08:00
|
|
|
/*
|
2013-02-23 08:34:33 +08:00
|
|
|
* Isolating the page has taken another reference, so the
|
|
|
|
* caller's reference can be safely dropped without the page
|
|
|
|
* disappearing underneath us during migration.
|
2012-11-27 22:03:05 +08:00
|
|
|
*/
|
|
|
|
put_page(page);
|
2013-02-23 08:34:33 +08:00
|
|
|
return 1;
|
2012-11-19 20:35:47 +08:00
|
|
|
}
|
|
|
|
|
2013-12-19 09:08:42 +08:00
|
|
|
bool pmd_trans_migrating(pmd_t pmd)
|
|
|
|
{
|
|
|
|
struct page *page = pmd_page(pmd);
|
|
|
|
return PageLocked(page);
|
|
|
|
}
|
|
|
|
|
2012-11-19 20:35:47 +08:00
|
|
|
/*
|
|
|
|
* Attempt to migrate a misplaced page to the specified destination
|
|
|
|
* node. Caller is expected to have an elevated reference count on
|
|
|
|
* the page that will be dropped by this function before returning.
|
|
|
|
*/
|
2013-10-07 18:29:05 +08:00
|
|
|
int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
|
|
|
|
int node)
|
2012-11-19 20:35:47 +08:00
|
|
|
{
|
|
|
|
pg_data_t *pgdat = NODE_DATA(node);
|
2013-02-23 08:34:33 +08:00
|
|
|
int isolated;
|
2012-11-19 20:35:47 +08:00
|
|
|
int nr_remaining;
|
|
|
|
LIST_HEAD(migratepages);
|
|
|
|
|
|
|
|
/*
|
2013-10-07 18:29:05 +08:00
|
|
|
* Don't migrate file pages that are mapped in multiple processes
|
|
|
|
* with execute permissions as they are probably shared libraries.
|
2012-11-19 20:35:47 +08:00
|
|
|
*/
|
2013-10-07 18:29:05 +08:00
|
|
|
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
|
|
|
|
(vma->vm_flags & VM_EXEC))
|
2012-11-19 20:35:47 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Rate-limit the amount of data that is being migrated to a node.
|
|
|
|
* Optimal placement is no good if the memory bus is saturated and
|
|
|
|
* all the time is being spent migrating!
|
|
|
|
*/
|
2013-02-23 08:34:33 +08:00
|
|
|
if (numamigrate_update_ratelimit(pgdat, 1))
|
2012-11-19 20:35:47 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
isolated = numamigrate_isolate_page(pgdat, page);
|
|
|
|
if (!isolated)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
list_add(&page->lru, &migratepages);
|
2013-02-23 08:35:14 +08:00
|
|
|
nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
|
2014-06-05 07:08:25 +08:00
|
|
|
NULL, node, MIGRATE_ASYNC,
|
|
|
|
MR_NUMA_MISPLACED);
|
2012-11-19 20:35:47 +08:00
|
|
|
if (nr_remaining) {
|
2014-01-22 07:51:17 +08:00
|
|
|
if (!list_empty(&migratepages)) {
|
|
|
|
list_del(&page->lru);
|
|
|
|
dec_zone_page_state(page, NR_ISOLATED_ANON +
|
|
|
|
page_is_file_cache(page));
|
|
|
|
putback_lru_page(page);
|
|
|
|
}
|
2012-11-19 20:35:47 +08:00
|
|
|
isolated = 0;
|
|
|
|
} else
|
|
|
|
count_vm_numa_event(NUMA_PAGE_MIGRATE);
|
2012-10-25 20:16:34 +08:00
|
|
|
BUG_ON(!list_empty(&migratepages));
|
|
|
|
return isolated;
|
2013-02-23 08:34:33 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
put_page(page);
|
|
|
|
return 0;
|
2012-10-25 20:16:34 +08:00
|
|
|
}
|
2012-12-05 17:32:56 +08:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
2012-11-19 20:35:47 +08:00
|
|
|
|
2012-12-05 17:32:56 +08:00
|
|
|
#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
|
2013-02-23 08:34:33 +08:00
|
|
|
/*
|
|
|
|
* Migrates a THP to a given target node. page must be locked and is unlocked
|
|
|
|
* before returning.
|
|
|
|
*/
|
2012-11-19 20:35:47 +08:00
|
|
|
int migrate_misplaced_transhuge_page(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
pmd_t *pmd, pmd_t entry,
|
|
|
|
unsigned long address,
|
|
|
|
struct page *page, int node)
|
|
|
|
{
|
2013-11-15 06:31:04 +08:00
|
|
|
spinlock_t *ptl;
|
2012-11-19 20:35:47 +08:00
|
|
|
pg_data_t *pgdat = NODE_DATA(node);
|
|
|
|
int isolated = 0;
|
|
|
|
struct page *new_page = NULL;
|
|
|
|
int page_lru = page_is_file_cache(page);
|
2013-12-19 09:08:33 +08:00
|
|
|
unsigned long mmun_start = address & HPAGE_PMD_MASK;
|
|
|
|
unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
|
2013-12-19 09:08:32 +08:00
|
|
|
pmd_t orig_entry;
|
2012-11-19 20:35:47 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Rate-limit the amount of data that is being migrated to a node.
|
|
|
|
* Optimal placement is no good if the memory bus is saturated and
|
|
|
|
* all the time is being spent migrating!
|
|
|
|
*/
|
2012-11-29 17:24:36 +08:00
|
|
|
if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
|
2012-11-19 20:35:47 +08:00
|
|
|
goto out_dropref;
|
|
|
|
|
|
|
|
new_page = alloc_pages_node(node,
|
2015-11-07 08:28:28 +08:00
|
|
|
(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
|
2014-03-11 06:49:43 +08:00
|
|
|
HPAGE_PMD_ORDER);
|
2013-02-23 08:34:33 +08:00
|
|
|
if (!new_page)
|
|
|
|
goto out_fail;
|
2016-01-16 08:54:17 +08:00
|
|
|
prep_transhuge_page(new_page);
|
2013-02-23 08:34:33 +08:00
|
|
|
|
2012-11-19 20:35:47 +08:00
|
|
|
isolated = numamigrate_isolate_page(pgdat, page);
|
2013-02-23 08:34:33 +08:00
|
|
|
if (!isolated) {
|
2012-11-19 20:35:47 +08:00
|
|
|
put_page(new_page);
|
2013-02-23 08:34:33 +08:00
|
|
|
goto out_fail;
|
2012-11-19 20:35:47 +08:00
|
|
|
}
|
2016-03-18 05:18:56 +08:00
|
|
|
/*
|
|
|
|
* We are not sure a pending tlb flush here is for a huge page
|
|
|
|
* mapping or not. Hence use the tlb range variant
|
|
|
|
*/
|
2013-12-19 09:08:46 +08:00
|
|
|
if (mm_tlb_flush_pending(mm))
|
|
|
|
flush_tlb_range(vma, mmun_start, mmun_end);
|
|
|
|
|
2012-11-19 20:35:47 +08:00
|
|
|
/* Prepare a page as a migration target */
|
2016-01-16 08:51:24 +08:00
|
|
|
__SetPageLocked(new_page);
|
2012-11-19 20:35:47 +08:00
|
|
|
SetPageSwapBacked(new_page);
|
|
|
|
|
|
|
|
/* anon mapping, we can simply copy page->mapping to the new page: */
|
|
|
|
new_page->mapping = page->mapping;
|
|
|
|
new_page->index = page->index;
|
|
|
|
migrate_page_copy(new_page, page);
|
|
|
|
WARN_ON(PageLRU(new_page));
|
|
|
|
|
|
|
|
/* Recheck the target PMD */
|
2013-12-19 09:08:33 +08:00
|
|
|
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
|
2013-11-15 06:31:04 +08:00
|
|
|
ptl = pmd_lock(mm, pmd);
|
2013-12-19 09:08:32 +08:00
|
|
|
if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
|
|
|
|
fail_putback:
|
2013-11-15 06:31:04 +08:00
|
|
|
spin_unlock(ptl);
|
2013-12-19 09:08:33 +08:00
|
|
|
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
|
2012-11-19 20:35:47 +08:00
|
|
|
|
|
|
|
/* Reverse changes made by migrate_page_copy() */
|
|
|
|
if (TestClearPageActive(new_page))
|
|
|
|
SetPageActive(page);
|
|
|
|
if (TestClearPageUnevictable(new_page))
|
|
|
|
SetPageUnevictable(page);
|
|
|
|
|
|
|
|
unlock_page(new_page);
|
|
|
|
put_page(new_page); /* Free it */
|
|
|
|
|
2013-10-07 18:28:46 +08:00
|
|
|
/* Retake the callers reference and putback on LRU */
|
|
|
|
get_page(page);
|
2012-11-19 20:35:47 +08:00
|
|
|
putback_lru_page(page);
|
2013-10-07 18:28:46 +08:00
|
|
|
mod_zone_page_state(page_zone(page),
|
|
|
|
NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
|
2013-12-19 09:08:39 +08:00
|
|
|
|
|
|
|
goto out_unlock;
|
2012-11-19 20:35:47 +08:00
|
|
|
}
|
|
|
|
|
2013-12-19 09:08:32 +08:00
|
|
|
orig_entry = *pmd;
|
2012-11-19 20:35:47 +08:00
|
|
|
entry = mk_pmd(new_page, vma->vm_page_prot);
|
|
|
|
entry = pmd_mkhuge(entry);
|
2013-12-19 09:08:32 +08:00
|
|
|
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
|
2012-11-19 20:35:47 +08:00
|
|
|
|
2013-12-19 09:08:32 +08:00
|
|
|
/*
|
|
|
|
* Clear the old entry under pagetable lock and establish the new PTE.
|
|
|
|
* Any parallel GUP will either observe the old page blocking on the
|
|
|
|
* page lock, block on the page table lock or observe the new page.
|
|
|
|
* The SetPageUptodate on the new page and page_add_new_anon_rmap
|
|
|
|
* guarantee the copy is visible before the pagetable update.
|
|
|
|
*/
|
2013-12-19 09:08:33 +08:00
|
|
|
flush_cache_range(vma, mmun_start, mmun_end);
|
2016-01-16 08:52:16 +08:00
|
|
|
page_add_anon_rmap(new_page, vma, mmun_start, true);
|
2015-06-25 07:57:44 +08:00
|
|
|
pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
|
2013-12-19 09:08:33 +08:00
|
|
|
set_pmd_at(mm, mmun_start, pmd, entry);
|
2012-12-10 16:50:57 +08:00
|
|
|
update_mmu_cache_pmd(vma, address, &entry);
|
2013-12-19 09:08:32 +08:00
|
|
|
|
|
|
|
if (page_count(page) != 2) {
|
2013-12-19 09:08:33 +08:00
|
|
|
set_pmd_at(mm, mmun_start, pmd, orig_entry);
|
2016-03-18 05:18:56 +08:00
|
|
|
flush_pmd_tlb_range(vma, mmun_start, mmun_end);
|
2014-11-13 10:46:09 +08:00
|
|
|
mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
|
2013-12-19 09:08:32 +08:00
|
|
|
update_mmu_cache_pmd(vma, address, &entry);
|
2016-01-16 08:52:16 +08:00
|
|
|
page_remove_rmap(new_page, true);
|
2013-12-19 09:08:32 +08:00
|
|
|
goto fail_putback;
|
|
|
|
}
|
|
|
|
|
2015-11-06 10:49:37 +08:00
|
|
|
mlock_migrate_page(new_page, page);
|
2016-01-16 08:52:16 +08:00
|
|
|
page_remove_rmap(page, true);
|
2016-03-16 05:56:18 +08:00
|
|
|
set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
|
2013-12-19 09:08:32 +08:00
|
|
|
|
2013-11-15 06:31:04 +08:00
|
|
|
spin_unlock(ptl);
|
2013-12-19 09:08:33 +08:00
|
|
|
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
|
2012-11-19 20:35:47 +08:00
|
|
|
|
2014-06-05 07:07:41 +08:00
|
|
|
/* Take an "isolate" reference and put new page on the LRU. */
|
|
|
|
get_page(new_page);
|
|
|
|
putback_lru_page(new_page);
|
|
|
|
|
2012-11-19 20:35:47 +08:00
|
|
|
unlock_page(new_page);
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page); /* Drop the rmap reference */
|
|
|
|
put_page(page); /* Drop the LRU isolation reference */
|
|
|
|
|
|
|
|
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
|
|
|
|
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
|
|
|
|
|
|
|
|
mod_zone_page_state(page_zone(page),
|
|
|
|
NR_ISOLATED_ANON + page_lru,
|
|
|
|
-HPAGE_PMD_NR);
|
|
|
|
return isolated;
|
|
|
|
|
2013-02-23 08:34:33 +08:00
|
|
|
out_fail:
|
|
|
|
count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
|
2012-11-19 20:35:47 +08:00
|
|
|
out_dropref:
|
2013-12-19 09:08:32 +08:00
|
|
|
ptl = pmd_lock(mm, pmd);
|
|
|
|
if (pmd_same(*pmd, entry)) {
|
2015-02-13 06:58:28 +08:00
|
|
|
entry = pmd_modify(entry, vma->vm_page_prot);
|
2013-12-19 09:08:33 +08:00
|
|
|
set_pmd_at(mm, mmun_start, pmd, entry);
|
2013-12-19 09:08:32 +08:00
|
|
|
update_mmu_cache_pmd(vma, address, &entry);
|
|
|
|
}
|
|
|
|
spin_unlock(ptl);
|
2013-10-07 18:28:46 +08:00
|
|
|
|
2013-12-19 09:08:39 +08:00
|
|
|
out_unlock:
|
2013-02-23 08:34:33 +08:00
|
|
|
unlock_page(page);
|
2012-11-19 20:35:47 +08:00
|
|
|
put_page(page);
|
|
|
|
return 0;
|
|
|
|
}
|
2012-10-25 20:16:34 +08:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
|
|
|
|
|
|
|
#endif /* CONFIG_NUMA */
|