When I tested my new patches, I found that page pointer which is used
for setting page_owner information is changed. This is because page
pointer is used to set new migratetype in loop. After this work, page
pointer could be out of bound. If this wrong pointer is used for
page_owner, access violation happens. Below is error message that I
got.
BUG: unable to handle kernel paging request at 0000000000b00018
IP: [<ffffffff81025f30>] save_stack_address+0x30/0x40
PGD 1af2d067 PUD 166e0067 PMD 0
Oops: 0002 [#1] SMP
...snip...
Call Trace:
print_context_stack+0xcf/0x100
dump_trace+0x15f/0x320
save_stack_trace+0x2f/0x50
__set_page_owner+0x46/0x70
__isolate_free_page+0x1f7/0x210
split_free_page+0x21/0xb0
isolate_freepages_block+0x1e2/0x410
compaction_alloc+0x22d/0x2d0
migrate_pages+0x289/0x8b0
compact_zone+0x409/0x880
compact_zone_order+0x6d/0x90
try_to_compact_pages+0x110/0x210
__alloc_pages_direct_compact+0x3d/0xe6
__alloc_pages_nodemask+0x6cd/0x9a0
alloc_pages_current+0x91/0x100
runtest_store+0x296/0xa50
simple_attr_write+0xbd/0xe0
__vfs_write+0x28/0xf0
vfs_write+0xa9/0x1b0
SyS_write+0x46/0xb0
system_call_fastpath+0x16/0x75
This patch fixes this error by moving up set_page_owner().
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kbuild test robot reported the following
tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
head: 14a6f1989d
commit: 3b242c66cc x86: mm: enable deferred struct page initialisation on x86-64
date: 3 days ago
config: x86_64-randconfig-x006-201527 (attached as .config)
reproduce:
git checkout 3b242c66cc
# save the attached .config to linux build tree
make ARCH=x86_64
All warnings (new ones prefixed by >>):
mm/page_alloc.c: In function 'early_page_uninitialised':
>> mm/page_alloc.c:247:6: warning: unused variable 'nid' [-Wunused-variable]
int nid = early_pfn_to_nid(pfn);
It's due to the NODE_DATA macro ignoring the nid parameter on !NUMA
configurations. This patch avoids the warning by not declaring nid.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred. One approach is to initialise
memory on demand but it interferes with page allocator paths. This patch
creates dedicated threads to initialise memory before basic setup. It
then blocks on a rw_semaphore until completion as a wait_queue and counter
is overkill. This may be slower to boot but it's simplier overall and
also gets rid of a section mangling which existed so kswapd could do the
initialisation.
[akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Waiman Long <waiman.long@hp.com
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Scott Norton <scott.norton@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mminit_verify_page_links() is an extremely paranoid check that was
introduced when memory initialisation was being heavily reworked.
Profiles indicated that up to 10% of parallel memory initialisation was
spent on checking this for every page. The cost could be reduced but in
practice this check only found problems very early during the
initialisation rewrite and has found nothing since. This patch removes an
expensive unnecessary check.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During parallel sturct page initialisation, ranges are checked for every
PFN unnecessarily which increases boot times. This patch alters when the
ranges are checked.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Parallel struct page frees pages one at a time. Try free pages as single
large pages where possible.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Deferred struct page initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only a subset of struct pages are initialised at the moment. When this
patch is applied kswapd initialise the remaining struct pages in parallel.
This should boot faster by spreading the work to multiple CPUs and
initialising data that is local to the CPU. The user-visible effect on
large machines is that free memory will appear to rapidly increase early
in the lifetime of the system until kswapd reports that all memory is
initialised in the kernel log. Once initialised there should be no other
user-visibile effects.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch initalises all low memory struct pages and 2G of the highest
zone on each node during memory initialisation if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
but will be available in a later patch. Parallel initialisation of struct
page depends on some features from memory hotplug and it is necessary to
alter alter section annotations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as
unnecessary visibility, it's unnecessary function call overhead when
initialising pages. This patch moves the helpers inline.
[akpm@linux-foundation.org: fix build]
[mhocko@suse.cz: fix build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__early_pfn_to_nid() use static variables to cache recent lookups as
memblock lookups are very expensive but it assumes that memory
initialisation is single-threaded. Parallel initialisation of struct
pages will break that assumption so this patch makes __early_pfn_to_nid()
SMP-safe by requiring the caller to cache recent search information.
early_pfn_to_nid() keeps the same interface but is only safe to use early
in boot due to the use of a global static variable. meminit_pfn_in_nid()
is an SMP-safe version that callers must maintain their own state for.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation
of struct pages defers initialisation and __free_pages_bootmem can be
called for struct pages that cannot yet map struct page to PFN. This
patch passes PFN to __free_pages_bootmem with no other functional change.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently each page struct is set as reserved upon initialization. This
patch leaves the reserved bit clear and only sets the reserved bit when it
is known the memory was allocated by the bootmem allocator. This makes it
easier to distinguish between uninitialised struct pages and reserved
struct pages in later patches.
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and
so this patch breaks up the monolithic function in preparation.
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patchbomb from Andrew Morton:
- a few misc things
- ocfs2 udpates
- kernel/watchdog.c feature work (took ages to get right)
- most of MM. A few tricky bits are held up and probably won't make 4.2.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits)
mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
mm, thp: respect MPOL_PREFERRED policy with non-local node
tmpfs: truncate prealloc blocks past i_size
mm/memory hotplug: print the last vmemmap region at the end of hot add memory
mm/mmap.c: optimization of do_mmap_pgoff function
mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
mm: kmemleak: fix delete_object_*() race when called on the same memory block
mm: kmemleak: allow safe memory scanning during kmemleak disabling
memcg: convert mem_cgroup->under_oom from atomic_t to int
memcg: remove unused mem_cgroup->oom_wakeups
frontswap: allow multiple backends
x86, mirror: x86 enabling - find mirrored memory ranges
mm/memblock: allocate boot time data structures from mirrored memory
mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
mm: do not ignore mapping_gfp_mask in page cache allocation paths
mm/cma.c: fix typos in comments
mm/oom_kill.c: print points as unsigned int
mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
...
The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress. The physical
separation of those conditions make the code hard to follow.
Inline the should_alloc_retry() checks. Notes:
- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
replace it with looping on OOM killer progress
- The pm_suspended_storage() check is meant to skip the OOM killer
when reclaim has no IO available, move to __alloc_pages_may_oom()
- The order <= PAGE_ALLOC_COSTLY order is re-united with its original
counterpart of checking whether reclaim actually made any progress
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.
The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain. Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.
The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.
However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills. And
the oom_sem is just flat-out redundant.
Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's been five years now that KM_* kmap flags have been removed and that
we can call clear_highpage from any context. So we remove prep_zero_pages
accordingly.
Signed-off-by: Anisse Astier <anisse@astier.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For !CONFIG_NUMA, hashdist will always be 0, since it's setter is
otherwise compiled out. So we can save 4 bytes of data and some .text
(although mostly in __init functions) by only defining it for
CONFIG_NUMA.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.
Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.
I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.
This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses. This makes things cleaner, instead
of using separate/multiple sets of APIs.
Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.
GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected. It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
allocator slowpath to fail without trying reclaim even though it may be
used in combination with __GFP_WAIT.
An example of the problem this creates: commit e97ca8e5b8 ("mm: fix
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE. The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a
no-op in these cases since the page allocator special-cases
__GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.
It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity. Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.
This allows the aforementioned functions to actually reclaim as they
should. It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.
Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems nobody needs this.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes show_mem() much less verbose on huge machines. Instead of huge
and almost useless dump of counters for each per-zone per-cpu lists this
patch prints the sum of these counters for each zone (free_pcp) and size
of per-cpu list for current cpu (local_pcp).
The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode.
[akpm@linux-foundation.org: update show_free_areas comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction has anti fragmentation algorithm. It is that freepage should
be more than pageblock order to finish the compaction if we don't find any
freepage in requested migratetype buddy list. This is for mitigating
fragmentation, but, there is a lack of migratetype consideration and it is
too excessive compared to page allocator's anti fragmentation algorithm.
Not considering migratetype would cause premature finish of compaction.
For example, if allocation request is for unmovable migratetype, freepage
with CMA migratetype doesn't help that allocation and compaction should
not be stopped. But, current logic regards this situation as compaction
is no longer needed, so finish the compaction.
Secondly, condition is too excessive compared to page allocator's logic.
We can steal freepage from other migratetype and change pageblock
migratetype on more relaxed conditions in page allocator. This is
designed to prevent fragmentation and we can use it here. Imposing hard
constraint only to the compaction doesn't help much in this case since
page allocator would cause fragmentation again.
To solve these problems, this patch borrows anti fragmentation logic from
page allocator. It will reduce premature compaction finish in some cases
and reduce excessive compaction work.
stress-highalloc test in mmtests with non movable order 7 allocation shows
considerable increase of compaction success rate.
Compaction success rate (Compaction success * 100 / Compaction stalls, %)
31.82 : 42.20
I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
there is no more degradation on allocation success rate than before. That
roughly means that this patch doesn't result in more fragmentations.
Vlastimil suggests additional idea that we only test for fallbacks when
migration scanner has scanned a whole pageblock. It looked good for
fragmentation because chance of stealing increase due to making more free
pages in certain pageblock. So, I tested it, but, it results in decreased
compaction success rate, roughly 38.00. I guess the reason that if system
is low memory condition, watermark check could be failed due to not enough
order 0 free page and so, sometimes, we can't reach a fallback check
although migrate_pfn is aligned to pageblock_nr_pages. I can insert code
to cope with this situation but it makes code more complicated so I don't
include his idea at this patch.
[akpm@linux-foundation.org: fix CONFIG_CMA=n build]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is preparation step to use page allocator's anti fragmentation logic
in compaction. This patch just separates fallback freepage checking part
from fallback freepage management part. Therefore, there is no functional
change.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Freepage with MIGRATE_CMA can be used only for MIGRATE_MOVABLE and they
should not be expanded to other migratetype buddy list to protect them
from unmovable/reclaimable allocation. Implementing these requirements in
__rmqueue_fallback(), that is, finding largest possible block of freepage
has bad effect that high order freepage with MIGRATE_CMA are broken
continually although there are suitable order CMA freepage. Reason is
that they are not be expanded to other migratetype buddy list and next
__rmqueue_fallback() invocation try to finds another largest block of
freepage and break it again. So, MIGRATE_CMA fallback should be handled
separately. This patch introduces __rmqueue_cma_fallback(), that just
wrapper of __rmqueue_smallest() and call it before __rmqueue_fallback() if
migratetype == MIGRATE_MOVABLE.
This results in unintended behaviour change that MIGRATE_CMA freepage is
always used first rather than other migratetype as movable allocation's
fallback. But, as already mentioned above, MIGRATE_CMA can be used only
for MIGRATE_MOVABLE, so it is better to use MIGRATE_CMA freepage first as
much as possible. Otherwise, we needlessly take up precious freepages
with other migratetype and increase chance of fragmentation.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa has pointed out that __GFP_NOFAIL allocations might fail
after OOM killer is disabled if the allocation is performed by a kernel
thread. This behavior was introduced from the very beginning by
7f33d49a2e ("mm, PM/Freezer: Disable OOM killer when tasks are frozen").
This means that the basic contract for the allocation request is broken
and the context requesting such an allocation might blow up unexpectedly.
There are basically two ways forward.
1) move oom_killer_disable after kernel threads are frozen. This has a
risk that the OOM victim wouldn't be able to finish because it would
depend on an already frozen kernel thread. This would be really tricky
to debug.
2) do not fail GFP_NOFAIL allocation no matter what and risk a
potential Freezable kernel threads will loop and fail the suspend.
Incidental allocations after kernel threads are frozen will at least
dump a warning - if we are lucky and the serial console is still active
of course...
This patch implements the later option because it is safer. We would see
warning rather than allocation failures for the kernel threads which would
blow up otherwise and have a higher chances to identify __GFP_NOFAIL users
from deeper pm code.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@gooogle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.
Commit 9879de7373 ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point. This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.
Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.
Fixes: 9879de7373 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pulling the code protected by if (system_state == SYSTEM_BOOTING) into
its own helper allows us to shrink .text a little. This relies on
build_all_zonelists already having a __ref annotation. Add a comment
explaining why so one doesn't have to track it down through git log.
The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist
as __init"), where we save about 400 bytes
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Cc: Pintu Kumar <pintu.k@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When allocation falls back to stealing free pages of another migratetype,
it can decide to steal extra pages, or even the whole pageblock in order
to reduce fragmentation, which could happen if further allocation
fallbacks pick a different pageblock. In try_to_steal_freepages(), one of
the situations where extra pages are stolen happens when we are trying to
allocate a MIGRATE_RECLAIMABLE page.
However, MIGRATE_UNMOVABLE allocations are not treated the same way,
although spreading such allocation over multiple fallback pageblocks is
arguably even worse than it is for RECLAIMABLE allocations. To minimize
fragmentation, we should minimize the number of such fallbacks, and thus
steal as much as is possible from each fallback pageblock.
Note that in theory this might put more pressure on movable pageblocks and
cause movable allocations to steal back from unmovable pageblocks.
However, movable allocations are not as aggressive with stealing, and do
not cause permanent fragmentation, so the tradeoff is reasonable, and
evaluation seems to support the change.
This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
steal extra free pages. When evaluating with stress-highalloc from
mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
roughly 1/6. The number of these fallbacks stealing from MIGRATE_MOVABLE
block is reduced to 1/3. There was no observation of growing number of
unmovable pageblocks over time, and also not of increased movable
allocation fallbacks.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When allocation falls back to another migratetype, it will steal a page
with highest available order, and (depending on this order and desired
migratetype), it might also steal the rest of free pages from the same
pageblock.
Given the preference of highest available order, it is likely that it will
be higher than the desired order, and result in the stolen buddy page
being split. The remaining pages after split are currently stolen only
when the rest of the free pages are stolen. This can however lead to
situations where for MOVABLE allocations we split e.g. order-4 fallback
UNMOVABLE page, but steal only order-0 page. Then on the next MOVABLE
allocation (which may be batched to fill the pcplists) we split another
order-3 or higher page, etc. By stealing all pages that we have split, we
can avoid further stealing.
This patch therefore adjusts the page stealing so that buddy pages created
by split are always stolen. This has effect only on MOVABLE allocations,
as RECLAIMABLE and UNMOVABLE allocations already always do that in
addition to stealing the rest of free pages from the pageblock. The
change also allows to simplify try_to_steal_freepages() and factor out CMA
handling.
According to Mel, it has been intended since the beginning that buddy
pages after split would be stolen always, but it doesn't seem like it was
ever the case until commit 47118af076 ("mm: mmzone: MIGRATE_CMA
migration type added"). The commit has unintentionally introduced this
behavior, but was reverted by commit 0cbef29a78 ("mm:
__rmqueue_fallback() should respect pageblock type"). Neither included
evaluation.
My evaluation with stress-highalloc from mmtests shows about 2.5x
reduction of page stealing events for MOVABLE allocations, without
affecting the page stealing events for other allocation migratetypes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages(). The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.
Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks. Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.
The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results. Here I used different baseline which includes RFC compaction
improvements from [1]. I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.
First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test. First
column is after fresh reboot, other two are reiterations of test without
reboot. That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Page alloc extfrag event 10264225 8702233 10244125
Extfrag fragmenting 10263271 8701552 10243473
Extfrag fragmenting for unmovable 13595 17616 15960
Extfrag fragmenting unmovable placed with movable 7989 12193 8447
Extfrag fragmenting for reclaimable 658 1840 1817
Extfrag fragmenting reclaimable placed with movable 558 1677 1679
Extfrag fragmenting for movable 10249018 8682096 10225696
With Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Page alloc extfrag event 11834954 9877523 9774860
Extfrag fragmenting 11833993 9876880 9774245
Extfrag fragmenting for unmovable 7342 16129 11712
Extfrag fragmenting unmovable placed with movable 4191 10547 6270
Extfrag fragmenting for reclaimable 373 1130 923
Extfrag fragmenting reclaimable placed with movable 302 906 738
Extfrag fragmenting for movable 11826278 9859621 9761610
With Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Page alloc extfrag event 4725990 3668793 3807436
Extfrag fragmenting 4725104 3668252 3806898
Extfrag fragmenting for unmovable 6678 7974 7281
Extfrag fragmenting unmovable placed with movable 2051 3829 4017
Extfrag fragmenting for reclaimable 429 1208 1278
Extfrag fragmenting reclaimable placed with movable 369 976 1034
Extfrag fragmenting for movable 4717997 3659070 3798339
With Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Page alloc extfrag event 5016183 4700142 3850633
Extfrag fragmenting 5015325 4699613 3850072
Extfrag fragmenting for unmovable 1312 3154 3088
Extfrag fragmenting unmovable placed with movable 1115 2777 2714
Extfrag fragmenting for reclaimable 437 1193 1097
Extfrag fragmenting reclaimable placed with movable 330 969 879
Extfrag fragmenting for movable 5013576 4695266 3845887
In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise. Here, each patch
improves the situation for unmovable events. Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO. The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless. These are least critical as
compaction can move them around.
If we look at success rates, the patches don't affect them, that didn't change.
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-nothp-1 5-nothp-2 5-nothp-3
Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%)
Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%)
Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%)
Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%)
Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%)
Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%)
Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%)
Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)
Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%)
Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%)
Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%)
Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%)
Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%)
Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%)
Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%)
Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%)
Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%)
Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%)
Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%)
Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%)
Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%)
Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%)
Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%)
Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%)
Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%)
Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%)
Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%)
Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%)
Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%)
Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%)
Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%)
Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%)
Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%)
While there's no improvement here, I consider reduced fragmentation events
to be worth on its own. Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:
Patch 1:
Compaction stalls 4153 3959 3978
Compaction success 1523 1441 1446
Compaction failures 2630 2517 2531
Page migrate success 4600827 4943120 5104348
Page migrate failure 19763 16656 17806
Compaction pages isolated 9597640 10305617 10653541
Compaction migrate scanned 77828948 86533283 87137064
Compaction free scanned 517758295 521312840 521462251
Compaction cost 5503 5932 6110
Patch 2:
Compaction stalls 3800 3450 3518
Compaction success 1421 1316 1317
Compaction failures 2379 2134 2201
Page migrate success 4160421 4502708 4752148
Page migrate failure 19705 14340 14911
Compaction pages isolated 8731983 9382374 9910043
Compaction migrate scanned 98362797 96349194 98609686
Compaction free scanned 496512560 469502017 480442545
Compaction cost 5173 5526 5811
As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.
Configuring the benchmark to allocate like THP page fault (i.e. no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot. This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.
Baseline:
3.19-rc4 3.19-rc4 3.19-rc4
5-thp-1 5-thp-2 5-thp-3
Page alloc extfrag event 8148965 6227815 6646741
Extfrag fragmenting 8147872 6227130 6646117
Extfrag fragmenting for unmovable 10324 12942 15975
Extfrag fragmenting unmovable placed with movable 5972 8495 10907
Extfrag fragmenting for reclaimable 601 1707 2210
Extfrag fragmenting reclaimable placed with movable 520 1570 2000
Extfrag fragmenting for movable 8136947 6212481 6627932
Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-thp-1 6-thp-2 6-thp-3
Page alloc extfrag event 8345457 7574471 7020419
Extfrag fragmenting 8343546 7573777 7019718
Extfrag fragmenting for unmovable 10256 18535 30716
Extfrag fragmenting unmovable placed with movable 6893 11726 22181
Extfrag fragmenting for reclaimable 465 1208 1023
Extfrag fragmenting reclaimable placed with movable 353 996 843
Extfrag fragmenting for movable 8332825 7554034 6987979
Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-thp-1 7-thp-2 7-thp-3
Page alloc extfrag event 3512847 3020756 2891625
Extfrag fragmenting 3511940 3020185 2891059
Extfrag fragmenting for unmovable 9017 6892 6191
Extfrag fragmenting unmovable placed with movable 1524 3053 2435
Extfrag fragmenting for reclaimable 445 1081 1160
Extfrag fragmenting reclaimable placed with movable 375 918 986
Extfrag fragmenting for movable 3502478 3012212 2883708
Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-thp-1 8-thp-2 8-thp-3
Page alloc extfrag event 3181699 3082881 2674164
Extfrag fragmenting 3180812 3082303 2673611
Extfrag fragmenting for unmovable 1201 4031 4040
Extfrag fragmenting unmovable placed with movable 974 3611 3645
Extfrag fragmenting for reclaimable 478 1165 1294
Extfrag fragmenting reclaimable placed with movable 387 985 1030
Extfrag fragmenting for movable 3179133 3077107 2668277
The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1. Anyway, patch 2 rectifies it.
Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.
[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
This patch (of 3):
When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype. With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:
1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)
These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g. distribute UNMOVABLE allocations among multiple
pageblocks.
Originally, decision for 1) has implied the decision for 3). Commit
47118af076 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.
Commit fef903efcf ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended. Commit 0cbef29a78 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1). This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.
This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.
Note that evaluation has shown that the behavior introduced by
47118af076 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again. For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a78.
[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.13+ containing 0cbef29a78]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5695be142e ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.
Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.
oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.
oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.
The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.
out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.
Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Especially on 32 bit kernels memory node ranges are printed with 32 bit
wide addresses only. Use u64 types and %llx specifiers to print full
width of addresses.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although it was not called, destroy_compound_page() did some potentially
useful checks. Let's re-introduce them in free_pages_prepare(), where
they can be actually triggered when CONFIG_DEBUG_VM=y.
compound_order() assert is already in free_pages_prepare(). We have few
checks for tail pages left.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only caller is __free_one_page(). By the time we should have
page->flags to be cleared already:
- for 0-order pages though PCP list:
free_hot_cold_page()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
<put the page to PCP list>
free_pcppages_bulk()
page = <withdraw pages from PCP list>
__free_one_page(page)
- for non-0-order pages:
__free_pages_ok()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
free_one_page()
__free_one_page()
So there's no way PageCompound() will return true in __free_one_page().
Let's remove dead destroy_compound_page() and put assert for page->flags
there instead.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Expand the usage of the struct alloc_context introduced in the previous
patch also for calling try_to_compact_pages(), to reduce the number of its
parameters. Since the function is in different compilation unit, we need
to move alloc_context definition in the shared mm/internal.h header.
With this change we get simpler code and small savings of code size and stack
usage:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
function old new delta
__alloc_pages_direct_compact 283 256 -27
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
function old new delta
try_to_compact_pages 582 569 -13
Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
scripts/checkstack.pl).
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce struct alloc_context to accumulate the numerous parameters
passed between the alloc_pages* family of functions and
get_page_from_freelist(). This excludes gfp_flags and alloc_info, which
mutate too much along the way, and allocation order, which is conceptually
different.
The result is shorter function signatures, as well as overal code size and
stack usage reductions.
bloat-o-meter:
add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
function old new delta
get_page_from_freelist 2525 2652 +127
__alloc_pages_direct_compact 329 283 -46
__alloc_pages_nodemask 2564 2300 -264
checkstack.pl:
function old new
__alloc_pages_nodemask 248 200
get_page_from_freelist 168 184
__alloc_pages_direct_compact 40 24
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The possibility of replacing the numerous parameters of alloc_pages*
functions with a single structure has been discussed when Minchan proposed
to expand the x86 kernel stack [1]. This series implements the change,
along with few more cleanups/microoptimizations.
The series is based on next-20150108 and I used gcc 4.8.3 20140627 on
openSUSE 13.2 for compiling. Config includess NUMA and COMPACTION.
The core change is the introduction of a new struct alloc_context, which looks
like this:
struct alloc_context {
struct zonelist *zonelist;
nodemask_t *nodemask;
struct zone *preferred_zone;
int classzone_idx;
int migratetype;
enum zone_type high_zoneidx;
};
All the contents is mostly constant, except that __alloc_pages_slowpath()
changes preferred_zone, classzone_idx and potentially zonelist. But
that's not a problem in case control returns to retry_cpuset: in
__alloc_pages_nodemask(), those will be reset to initial values again
(although it's a bit subtle). On the other hand, gfp_flags and alloc_info
mutate so much that it doesn't make sense to put them into alloc_context.
Still, the result is one parameter instead of up to 7. This is all in
Patch 2.
Patch 3 is a step to expand alloc_context usage out of page_alloc.c
itself. The function try_to_compact_pages() can also much benefit from
the parameter reduction, but it means the struct definition has to be
moved to a shared header.
Patch 1 should IMHO be included even if the rest is deemed not useful
enough. It improves maintainability and also has some code/stack
reduction. Patch 4 is OTOH a tiny optimization.
Overall bloat-o-meter results:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-460 (-460)
function old new delta
nr_free_zone_pages 129 115 -14
__alloc_pages_direct_compact 329 256 -73
get_page_from_freelist 2670 2576 -94
__alloc_pages_nodemask 2564 2285 -279
try_to_compact_pages 582 579 -3
Overall stack sizes per ./scripts/checkstack.pl:
old new delta
get_page_from_freelist: 184 184 0
__alloc_pages_nodemask 248 200 -48
__alloc_pages_direct_c 40 - -40
try_to_compact_pages 72 72 0
-88
[1] http://marc.info/?l=linux-mm&m=140142462528257&w=2
This patch (of 4):
prep_new_page() sets almost everything in the struct page of the page
being allocated, except page->pfmemalloc. This is not obvious and has at
least once led to a bug where page->pfmemalloc was forgotten to be set
correctly, see commit 8fb74b9fb2 ("mm: compaction: partially revert
capture of suitable high-order page").
This patch moves the pfmemalloc setting to prep_new_page(), which means it
needs to gain alloc_flags parameter. The call to prep_new_page is moved
from buffered_rmqueue() to get_page_from_freelist(), which also leads to
simpler code. An obsolete comment for buffered_rmqueue() is replaced.
In addition to better maintainability there is a small reduction of code
and stack usage for get_page_from_freelist(), which inlines the other
functions involved.
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145)
function old new delta
get_page_from_freelist 2670 2525 -145
Stack usage is reduced from 184 to 168 bytes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now kmemcheck_pagealloc_alloc() is only called by __alloc_pages_slowpath().
__alloc_pages_nodemask()
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()
And the page will not be tracked by kmemcheck in the following path.
__alloc_pages_nodemask()
get_page_from_freelist()
So move kmemcheck_pagealloc_alloc() into __alloc_pages_nodemask(),
like this:
__alloc_pages_nodemask()
...
get_page_from_freelist()
if (!page)
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()
...
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_nodemask() strips __GFP_IO when retrying the page
allocation. But it does this by altering the function-wide variable
gfp_mask. This will cause subsequent allocation attempts to inadvertently
use the modified gfp_mask.
Also, pass the correct mask (the mask we actually used) into
trace_mm_page_alloc().
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the freeing page and its buddy page are not at the same zone, the
current holding zone->lock for the freeing page cann't prevent buddy page
getting allocated, this could trigger VM_BUG_ON_PAGE in page_is_buddy() at
a very tiny chance, such as:
cpu 0: cpu 1:
hold zone_1 lock
check page and it buddy
PageBuddy(buddy) is true hold zone_2 lock
page_order(buddy) == order is true alloc buddy
trigger VM_BUG_ON_PAGE(page_count(buddy) != 0)
zone_1->lock prevents the freeing page getting allocated
zone_2->lock prevents the buddy page getting allocated
they are not the same zone->lock.
If we can't remove the zone_id check statement, it's better handle this
rare race. This patch fixes this by placing the zone_id check before the
VM_BUG_ON_PAGE check.
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The OOM killing invocation does a lot of duplicative checks against the
task's allocation context. Rework it to take advantage of the existing
checks in the allocator slowpath.
The OOM killer is invoked when the allocator is unable to reclaim any
pages but the allocation has to keep looping. Instead of having a check
for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM
invocation to the true branch of should_alloc_retry(). The __GFP_FS
check from oom_gfp_allowed() can then be moved into the OOM avoidance
branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test.
__alloc_pages_may_oom() can then signal to the caller whether the OOM
killer was invoked, instead of requiring it to duplicate the order and
high_zoneidx checks to guess this when deciding whether to continue.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the system boots up, in the dmesg logs we can see the memory
statistics along with total reserved as below. Memory: 458840k/458840k
available, 65448k reserved, 0K highmem
When CMA is enabled, still the total reserved memory remains the same.
However, the CMA memory is not considered as reserved. But, when we see
/proc/meminfo, the CMA memory is part of free memory. This creates
confusion. This patch corrects the problem by properly subtracting the
CMA reserved memory from the total reserved memory in dmesg logs.
Below is the dmesg snapshot from an arm based device with 512MB RAM and
12MB single CMA region.
Before this change:
Memory: 458840k/458840k available, 65448k reserved, 0K highmem
After this change:
Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem
Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 01cefaef40 ("mm: provide more accurate estimation
of pages occupied by memmap") allocate the pages from lowmem for the
highmem zones' memmap. So It is not need to reserver the memmap's for
the highmem.
A 2G DDR3 for the arm platform:
On node 0 totalpages: 524288
free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
DMA zone: 3568 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 456704 pages, LIFO batch:31
HighMem zone: 528 pages used for memmap
HighMem zone: 67584 pages, LIFO batch:15
On node 0 totalpages: 524288
free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
DMA zone: 3568 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 456704 pages, LIFO batch:31
HighMem zone: 67584 pages, LIFO batch:15
Signed-off-by: Hongbo Zhong <hongbo.zhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.
Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.
Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.
Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.
For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.
This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.
Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.
[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the page owner tracking code which is introduced so far ago. It
is resident on Andrew's tree, though, nobody tried to upstream so it
remain as is. Our company uses this feature actively to debug memory leak
or to find a memory hogger so I decide to upstream this feature.
This functionality help us to know who allocates the page. When
allocating a page, we store some information about allocation in extra
memory. Later, if we need to know status of all pages, we can get and
analyze it from this stored information.
In previous version of this feature, extra memory is statically defined in
struct page, but, in this version, extra memory is allocated outside of
struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.
Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge the
trace buffer for preventing overlapping until userspace program launched.
And, launched program continually dump out the trace buffer for later
analysis and it would change system behaviour with more possibility rather
than just keeping it in memory, so bad for debug.
Moreover, we can use page_owner feature further for various purposes. For
example, we can use it for fragmentation statistics implemented in this
patch. And, I also plan to implement some CMA failure debugging feature
using this interface.
I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.
Contributor:
Alexander Nyberg <alexn@dsv.su.se>
Mel Gorman <mgorman@suse.de>
Dave Hansen <dave@linux.vnet.ibm.com>
Minchan Kim <minchan@kernel.org>
Michal Nazarewicz <mina86@mina86.com>
Andrew Morton <akpm@linux-foundation.org>
Jungsoo Son <jungsoo.son@lge.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, we have prepared to avoid using debug-pagealloc in boottime. So
introduce new kernel-parameter to disable debug-pagealloc in boottime, and
makes related functions to be disabled in this case.
Only non-intuitive part is change of guard page functions. Because guard
page is effective only if debug-pagealloc is enabled, turning off
according to debug-pagealloc is reasonable thing to do.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we debug something, we'd like to insert some information to every
page. For this purpose, we sometimes modify struct page itself. But,
this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency. At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel. Keeping this as it is would be better to reproduce errornous
situation.
This feature is intended to overcome above mentioned problems. This
feature allocates memory for extended data per page in certain place
rather than the struct page itself. This memory can be accessed by the
accessor functions provided by this code. During the boot process, it
checks whether allocation of huge chunk of memory is needed or not. If
not, it avoids allocating memory at all. With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.
Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature, so
this patch resurrect it.
To help these things to work well, this patch introduces two callbacks for
clients. One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time. The other is optional, init
callback, which is used to do proper initialization after memory is
allocated. Detailed explanation about purpose of these functions is in
code comment. Please refer it.
Others are completely same with previous extension code in memcg.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page guard is used by debug-pagealloc feature. Currently, it is
open-coded, but, I think that more abstraction of it makes core page
allocator code more readable.
There is no functional difference.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock
Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.
There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.
text data bss dec hex filename
8828345 1725264 983040 11536649 b00909 vmlinux.old
8827425 1725264 966656 11519345 afc571 vmlinux.new
[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goal of memory compaction is to create high-order freepages through
page migration. Page migration however puts pages on the per-cpu lru_add
cache, which is later flushed to per-cpu pcplists, and only after pcplists
are drained the pages can actually merge. This can happen due to the
per-cpu caches becoming full through further freeing, or explicitly.
During direct compaction, it is useful to do the draining explicitly so
that pages merge as soon as possible and compaction can detect success
immediately and keep the latency impact at minimum. However the current
implementation is far from ideal. Draining is done only in
__alloc_pages_direct_compact(), after all zones were already compacted,
and the decisions to continue or stop compaction in individual zones was
done without the last batch of migrations being merged. It is also
missing the draining of lru_add cache before the pcplists.
This patch moves the draining for direct compaction into compact_zone().
It adds the missing lru_cache draining and uses the newly introduced
single zone pcplists draining to reduce overhead and avoid impact on
unrelated zones. Draining is only performed when it can actually lead to
merging of a page of desired order (passed by cc->order). This means it
is only done when migration occurred in the previously scanned cc->order
aligned block(s) and the migration scanner is now pointing to the next
cc->order aligned block.
The patch has been tested with stress-highalloc benchmark from mmtests.
Although overal allocation success rates of the benchmark were not
affected, the number of detected compaction successes has doubled. This
suggests that allocations were previously successful due to implicit
merging caused by background activity, making a later allocation attempt
succeed immediately, but not attributing the success to compaction. Since
stress-highalloc always tries to allocate almost the whole memory, it
cannot show the improvement in its reported success rate metric. However
after this patch, compaction should detect success and terminate earlier,
reducing the direct compaction latencies in a real scenario.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 53853e2d2b ("mm, compaction: defer each zone individually
instead of preferred zone"), compaction is deferred for each zone where
sync direct compaction fails, and reset where it succeeds. However, it
was observed that for DMA zone compaction often appeared to succeed
while subsequent allocation attempt would not, due to different outcome
of watermark check.
In order to properly defer compaction in this zone, the candidate zone
has to be passed back to __alloc_pages_direct_compact() and compaction
deferred in the zone after the allocation attempt fails.
The large source of mismatch between watermark check in compaction and
allocation was the lack of alloc_flags and classzone_idx values in
compaction, which has been fixed in the previous patch. So with this
problem fixed, we can simplify the code by removing the candidate_zone
parameter and deferring in __alloc_pages_direct_compact().
After this patch, the compaction activity during stress-highalloc
benchmark is still somewhat increased, but it's negligible compared to the
increase that occurred without the better watermark checking. This
suggests that it is still possible to apparently succeed in compaction but
fail to allocate, possibly due to parallel allocation activity.
[akpm@linux-foundation.org: fix build]
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction relies on zone watermark checks for decisions such as if it's
worth to start compacting in compaction_suitable() or whether compaction
should stop in compact_finished(). The watermark checks take
classzone_idx and alloc_flags parameters, which are related to the memory
allocation request. But from the context of compaction they are currently
passed as 0, including the direct compaction which is invoked to satisfy
the allocation request, and could therefore know the proper values.
The lack of proper values can lead to mismatch between decisions taken
during compaction and decisions related to the allocation request. Lack
of proper classzone_idx value means that lowmem_reserve is not taken into
account. This has manifested (during recent changes to deferred
compaction) when DMA zone was used as fallback for preferred Normal zone.
compaction_suitable() without proper classzone_idx would think that the
watermarks are already satisfied, but watermark check in
get_page_from_freelist() would fail. Because of this problem, deferring
compaction has extra complexity that can be removed in the following
patch.
The issue (not confirmed in practice) with missing alloc_flags is opposite
in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
CMA-enabled systems) the watermark checking in compaction with 0 passed
will be stricter than in get_page_from_freelist(). In these cases
compaction might be running for a longer time than is really needed.
Another issue compaction_suitable() is that the check for "does the zone
need compaction at all?" comes only after the check "does the zone have
enough free free pages to succeed compaction". The latter considers extra
pages for migration and can therefore in some situations fail and return
COMPACT_SKIPPED, although the high-order allocation would succeed and we
should return COMPACT_PARTIAL.
This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction
and watermark checking. Where possible, all other callers of
compaction_suitable() pass proper values where those are known. This is
currently limited to classzone_idx, which is sometimes known in kswapd
context. However, the direct reclaim callers should_continue_reclaim()
and compaction_ready() do not currently know the proper values, so the
coordination between reclaim and compaction may still not be as accurate
as it could. This can be fixed later, if it's shown to be an issue.
Additionaly the checks in compact_suitable() are reordered to address the
second issue described above.
The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA. It allows simplifying deferred
compaction code in a followup patch.
When testing with stress-highalloc, there was some slight improvement
(which might be just due to variance) in success rates of non-THP-like
allocations.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows us to catch the bug fixed in the previous patch (mm: free
compound page with correct order).
Here we also verify whether a page is tail page or not -- tail pages are
supposed to be freed along with their head, not by themselves.
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CMA allocation drains pcplists so that pages can merge back to buddy
allocator. Since it operates on a single zone, we can reduce the
pcplists drain to the single zone, which is now possible.
The change should make CMA allocations faster and not disturbing
unrelated pcplists anymore.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.
This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.
All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One thing I did in this patch is fixing freepage accounting. If we
clear guard page and link it onto isolate buddy list, we should not
increase freepage count. This patch adds conditional branch to skip
counting in this case. Without this patch, this overcounting happens
frequently if guard order is set and CMA is used.
Another thing fixed in this patch is the target to reset order. In
__free_one_page(), we check the buddy page whether it is a guard page or
not. And, if so, we should clear guard attribute on the buddy page and
reset order of it to 0. But, current code resets original page's order
rather than buddy one's. Maybe, this doesn't have any problem, because
whole merged page's order will be re-assigned soon. But, it is better
to correct code.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Having test_pages_isolated failure message as a warning confuses users
into thinking that it is more serious than it really is. In reality, if
called via CMA, allocation will be retried so a single
test_pages_isolated failure does not prevent allocation from succeeding.
Demote the warning message to an info message and reformat it such that
the text "failed" does not appear and instead a less worrying "PFNS
busy" is used.
This message is trivially reproducible on a 10GB x86 machine on 3.16.y
kernels configured with CONFIG_DMA_CMA.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current pageblock isolation logic could isolate each pageblock
individually. This causes freepage accounting problem if freepage with
pageblock order on isolate pageblock is merged with other freepage on
normal pageblock. We can prevent merging by restricting max order of
merging to pageblock order if freepage is on isolate pageblock.
A side-effect of this change is that there could be non-merged buddy
freepage even if finishing pageblock isolation, because undoing
pageblock isolation is just to move freepage from isolate buddy list to
normal buddy list rather than to consider merging. So, the patch also
makes undoing pageblock isolation consider freepage merge. When
un-isolation, freepage with more than pageblock order and it's buddy are
checked. If they are on normal pageblock, instead of just moving, we
isolate the freepage and free it in order to get merged.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the caller of __free_one_page() has similar freepage counting logic,
so we can move it to __free_one_page(). This reduce line of code and
help future maintenance.
This is also preparation step for "mm/page_alloc: restrict max order of
merging on isolated pageblock" which fix the freepage counting problem
on freepage with more than pageblock order.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In free_pcppages_bulk(), we use cached migratetype of freepage to
determine type of buddy list where freepage will be added. This
information is stored when freepage is added to pcp list, so if
isolation of pageblock of this freepage begins after storing, this
cached information could be stale. In other words, it has original
migratetype rather than MIGRATE_ISOLATE.
There are two problems caused by this stale information.
One is that we can't keep these freepages from being allocated.
Although this pageblock is isolated, freepage will be added to normal
buddy list so that it could be allocated without any restriction. And
the other problem is incorrect freepage accounting. Freepages on
isolate pageblock should not be counted for number of freepage.
Following is the code snippet in free_pcppages_bulk().
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
if (likely(!is_migrate_isolate_page(page))) {
__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
}
As you can see above snippet, current code already handle second
problem, incorrect freepage accounting, by re-fetching pageblock
migratetype through is_migrate_isolate_page(page).
But, because this re-fetched information isn't used for
__free_one_page(), first problem would not be solved. This patch try to
solve this situation to re-fetch pageblock migratetype before
__free_one_page() and to use it for __free_one_page().
In addition to move up position of this re-fetch, this patch use
optimization technique, re-fetching migratetype only if there is isolate
pageblock. Pageblock isolation is rare event, so we can avoid
re-fetching in common case with this optimization.
This patch also correct migratetype of the tracepoint output.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before describing bugs itself, I first explain definition of freepage.
1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
Now, I describe problems and related patch.
Patch 1: There is race conditions on getting pageblock migratetype that
it results in misplacement of freepages on buddy list, incorrect
freepage count and un-availability of freepage.
Patch 2: Freepages on pcp list could have stale cached information to
determine migratetype of buddy list to go. This causes misplacement of
freepages on buddy list and incorrect freepage count.
Patch 4: Merging between freepages on different migratetype of
pageblocks will cause freepages accouting problem. This patch fixes it.
Without patchset [3], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.
With patchset [3], I did simple CMA allocation test and get below
result:
- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed
With patchset [3] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.
On my simple memory offlining test, these problems also occur on that
environment, too.
This patch (of 4):
There are two paths to reach core free function of buddy allocator,
__free_one_page(), one is free_one_page()->__free_one_page() and the
other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
Each paths has race condition causing serious problems. At first, this
patch is focused on first type of freepath. And then, following patch
will solve the problem in second type of freepath.
In the first type of freepath, we got migratetype of freeing page
without holding the zone lock, so it could be racy. There are two cases
of this race.
1. pages are added to isolate buddy list after restoring orignal
migratetype
CPU1 CPU2
get migratetype => return MIGRATE_ISOLATE
call free_one_page() with MIGRATE_ISOLATE
grab the zone lock
unisolate pageblock
release the zone lock
grab the zone lock
call __free_one_page() with MIGRATE_ISOLATE
freepage go into isolate buddy list,
although pageblock is already unisolated
This may cause two problems. One is that we can't use this page anymore
until next isolation attempt of this pageblock, because freepage is on
isolate buddy list. The other is that freepage accouting could be wrong
due to merging between different buddy list. Freepages on isolate buddy
list aren't counted as freepage, but ones on normal buddy list are
counted as freepage. If merge happens, buddy freepage on normal buddy
list is inevitably moved to isolate buddy list without any consideration
of freepage accouting so it could be incorrect.
2. pages are added to normal buddy list while pageblock is isolated.
It is similar with above case.
This also may cause two problems. One is that we can't keep these
freepages from being allocated. Although this pageblock is isolated,
freepage would be added to normal buddy list so that it could be
allocated without any restriction. And the other problem is same as
case 1, that it, incorrect freepage accouting.
This race condition would be prevented by checking migratetype again
with holding the zone lock. Because it is somewhat heavy operation and
it isn't needed in common case, we want to avoid rechecking as much as
possible. So this patch introduce new variable, nr_isolate_pageblock in
struct zone to check if there is isolated pageblock. With this, we can
avoid to re-check migratetype in common case and do it only if there is
isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
mentioned problems.
Changes from v3:
Add one more check in free_one_page() that checks whether migratetype is
MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Heesub Shin <heesub.shin@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.
Such a distinction was introduced by commit 02a0e53d82 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.
However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.
So let's simplify the API back to the single check.
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.
Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.
Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.
Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael
Fixes: f660daac47 (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull x86 mm updates from Ingo Molnar:
"This tree includes the following changes:
- fix memory hotplug
- fix hibernation bootup memory layout assumptions
- fix hyperv numa guest kernel messages
- remove dead code
- update documentation"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Update memory map description to list hypervisor-reserved area
x86/mm, hibernate: Do not assume the first e820 area to be RAM
x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()
x86/mm/hotplug: Modify PGD entry when removing memory
x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable()
x86: Remove set_pmd_pfn
dump_page() and dump_vma() are not specific to page_alloc.c, move them out
so page_alloc.c won't turn into the unofficial debug repository.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zones are allocated by the page allocator in either node or zone order.
Node ordering is preferred in terms of locality and is applied
automatically in one of three cases:
1. If a node has only low memory
2. If DMA/DMA32 is a high percentage of memory
3. If low memory on a single node is greater than 70% of the node size
Otherwise zone ordering is used to preserve low memory for devices that
require it. Unfortunately a consequence of this is that applications
running on a machine with balanced NUMA nodes will experience different
performance characteristics depending on which node they happen to start
from.
The point of zone ordering is to protect lower zones for devices that
require DMA/DMA32 memory. When NUMA was first introduced, this was
critical as 32-bit NUMA machines existed and exhausting low memory
triggered OOMs easily as so many allocations required low memory. On
64-bit machines the primary concern is devices that are 32-bit only which
is less severe than the low memory exhaustion problem on 32-bit NUMA. It
seems there are really few devices that depends on it.
AGP -- I assume this is getting more rare but even then I think the allocations
happen early in boot time where lowmem pressure is less of a problem
DRM -- If the device is 32-bit only then there may be low pressure. I didn't
evaluate these in detail but it looks like some of these are mobile
graphics card. Not many NUMA laptops out there. DRM folk should know
better though.
Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
B43 wireless card -- again not really a NUMA thing.
I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
machines in case someone throws a brain damanged TV or graphics card in there.
This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
to make it default everywhere but I understand that some embedded arches may
be using 32-bit NUMA where I cannot predict the consequences.
The performance impact depends on the workload and the characteristics of the
machine and the machine I tested on had a large Normal zone on node 0 so the
impact is within the noise for the majority of tests. The allocation stats
show more allocation requests were from DMA32 and local node. Running SpecJBB
with multiple JVMs and automatic NUMA balancing disabled the results were
specjbb
3.17.0-rc2 3.17.0-rc2
vanilla nodeorder-v1r1
Min 1 29534.00 ( 0.00%) 30020.00 ( 1.65%)
Min 10 115717.00 ( 0.00%) 134038.00 ( 15.83%)
Min 19 109718.00 ( 0.00%) 114186.00 ( 4.07%)
Min 28 104459.00 ( 0.00%) 103639.00 ( -0.78%)
Min 37 98245.00 ( 0.00%) 103756.00 ( 5.61%)
Min 46 97198.00 ( 0.00%) 96197.00 ( -1.03%)
Mean 1 30953.25 ( 0.00%) 31917.75 ( 3.12%)
Mean 10 124432.50 ( 0.00%) 140904.00 ( 13.24%)
Mean 19 116033.50 ( 0.00%) 119294.75 ( 2.81%)
Mean 28 108365.25 ( 0.00%) 106879.50 ( -1.37%)
Mean 37 102984.75 ( 0.00%) 106924.25 ( 3.83%)
Mean 46 100783.25 ( 0.00%) 105368.50 ( 4.55%)
Stddev 1 1260.38 ( 0.00%) 1109.66 ( 11.96%)
Stddev 10 7434.03 ( 0.00%) 5171.91 ( 30.43%)
Stddev 19 8453.84 ( 0.00%) 5309.59 ( 37.19%)
Stddev 28 4184.55 ( 0.00%) 2906.63 ( 30.54%)
Stddev 37 5409.49 ( 0.00%) 3192.12 ( 40.99%)
Stddev 46 4521.95 ( 0.00%) 7392.52 (-63.48%)
Max 1 32738.00 ( 0.00%) 32719.00 ( -0.06%)
Max 10 136039.00 ( 0.00%) 148614.00 ( 9.24%)
Max 19 130566.00 ( 0.00%) 127418.00 ( -2.41%)
Max 28 115404.00 ( 0.00%) 111254.00 ( -3.60%)
Max 37 112118.00 ( 0.00%) 111732.00 ( -0.34%)
Max 46 108541.00 ( 0.00%) 116849.00 ( 7.65%)
TPut 1 123813.00 ( 0.00%) 127671.00 ( 3.12%)
TPut 10 497730.00 ( 0.00%) 563616.00 ( 13.24%)
TPut 19 464134.00 ( 0.00%) 477179.00 ( 2.81%)
TPut 28 433461.00 ( 0.00%) 427518.00 ( -1.37%)
TPut 37 411939.00 ( 0.00%) 427697.00 ( 3.83%)
TPut 46 403133.00 ( 0.00%) 421474.00 ( 4.55%)
3.17.0-rc2 3.17.0-rc2
vanillanodeorder-v1r1
DMA allocs 0 0
DMA32 allocs 57 1491992
Normal allocs 32543566 30026383
Movable allocs 0 0
Direct pages scanned 0 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 0 0
Kswapd efficiency 100% 100%
Kswapd velocity 0.000 0.000
Direct efficiency 100% 100%
Direct velocity 0.000 0.000
Percentage direct scans 0% 0%
Zone normal velocity 0.000 0.000
Zone dma32 velocity 0.000 0.000
Zone dma velocity 0.000 0.000
THP fault alloc 55164 52987
THP collapse alloc 139 147
THP splits 26 21
NUMA alloc hit 4169066 4250692
NUMA alloc miss 0 0
Note that there were more DMA32 allocations with the patch applied. In this
particular case there was no difference in numa_hit and numa_miss. The
expectation is that DMA32 was being used at the low watermark instead of
falling into the slow path. kswapd was not woken but it's not worken for
THP allocations.
On 32-bit, this patch defaults to zone-ordering as low memory depletion
can be a serious problem on 32-bit large memory machines. If the default
ordering was node then processes on node 0 will deplete the Normal zone
due to normal activity. The problem is worse if CONFIG_HIGHPTE is not
set. If combined with large amounts of dirty/writeback pages in Normal
zone then there is also a high risk of OOM. The heuristics are removed
as it's not clear they were ever important on 32-bit. They were only
relevant for setting node-ordering on 64-bit.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 2.6.24 there has been a paranoid check in move_freepages that looks
up the zone of two pages. This is a very slow path and the only time I've
seen this bug trigger recently is when memory initialisation was broken
during patch development. Despite the fact it's a slow path, this patch
converts the check to a VM_BUG_ON anyway as it has served its purpose by
now.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
reader through layers indirection just to track down a simple bit.
Remove all zone flag wrappers and just use bitops against zone->flags
directly. It's just as readable and the lines are barely any longer.
Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
remove the zone_flags_t typedef.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
according to the zonelist and high_zoneidx. However, this doesn't take
nodemask into account, and could prematurely wakeup kswapd on some
unintended nodes.
This patch uses for_each_zone_zonelist_nodemask() instead of
for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
situation.
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a helper to dump information about a VMA, this also makes
dump_page_flags more generic and re-uses that so the output looks very
similar to dump_page:
[ 61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
[ 61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
[ 61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops (null)
[ 61.903437] pgoff 7ffffffdd file (null) private_data (null)
[ 61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)
[akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
[swarren@nvidia.com: fix dump_vma() compilation]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.
The function allocflags_to_migratetype() actually takes gfp flags, not
alloc flags, and returns a migratetype. Rename it to
gfpflags_to_migratetype().
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Async compaction aborts when it detects zone lock contention or
need_resched() is true. David Rientjes has reported that in practice,
most direct async compactions for THP allocation abort due to
need_resched(). This means that a second direct compaction is never
attempted, which might be OK for a page fault, but khugepaged is intended
to attempt a sync compaction in such case and in these cases it won't.
This patch replaces "bool contended" in compact_control with an int that
distinguishes between aborting due to need_resched() and aborting due to
lock contention. This allows propagating the abort through all compaction
functions as before, but passing the abort reason up to
__alloc_pages_slowpath() which decides when to continue with direct
reclaim and another compaction attempt.
Another problem is that try_to_compact_pages() did not act upon the
reported contention (both need_resched() or lock contention) immediately
and would proceed with another zone from the zonelist. When
need_resched() is true, that means initializing another zone compaction,
only to check again need_resched() in isolate_migratepages() and aborting.
For zone lock contention, the unintended consequence is that the lock
contended status reported back to the allocator is detrmined from the last
zone where compaction was attempted, which is rather arbitrary.
This patch fixes the problem in the following way:
- async compaction of a zone aborting due to need_resched() or fatal signal
pending means that further zones should not be tried. We report
COMPACT_CONTENDED_SCHED to the allocator.
- aborting zone compaction due to lock contention means we can still try
another zone, since it has different set of locks. We report back
COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
it was aborted due to lock contention.
As a result of these fixes, khugepaged will proceed with second sync
compaction as intended, when the preceding async compaction aborted due to
need_resched(). Page fault compactions aborting due to need_resched()
will spare some cycles previously wasted by initializing another zone
compaction only to abort again. Lock contention will be reported only
when compaction in all zones aborted due to lock contention, and therefore
it's not a good idea to try again after reclaim.
In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
has improved number of THP collapse allocations by 10%, which shows
positive effect on khugepaged. The benchmark's success rates are
unchanged as it is not recognized as khugepaged. Numbers of compact_stall
and compact_fail events have however decreased by 20%, with
compact_success still a bit improved, which is good. With benchmark
configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
collapse allocations, and only slight improvement in stalls and failures.
[akpm@linux-foundation.org: fix warnings]
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
isolate_migratepages_range() is the main function of the compaction
scanner, called either on a single pageblock by isolate_migratepages()
during regular compaction, or on an arbitrary range by CMA's
__alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
compaction suitability checks, and because of the CMA callpath, it tracks
if it crossed a pageblock boundary in order to repeat those checks.
However, closer inspection shows that those checks are always true for CMA:
- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
- migrate_async_suitable() check is skipped because CMA uses sync compaction
We can therefore move the compaction-specific checks to
isolate_migratepages() and simplify isolate_migratepages_range().
Furthermore, we can mimic the freepage scanner family of functions, which
has isolate_freepages_block() function called both by compaction from
isolate_freepages() and by CMA from isolate_freepages_range(), where each
use-case adds own specific glue code. This allows further code
simplification.
Thus, we rename isolate_migratepages_range() to
isolate_migratepages_block() and limit its functionality to a single
pageblock (or its subset). For CMA, a new different
isolate_migratepages_range() is created as a CMA-specific wrapper for the
_block() function. The checks specific to compaction are moved to
isolate_migratepages(). As part of the unification of these two families
of functions, we remove the redundant zone parameter where applicable,
since zone pointer is already passed in cc->zone.
Furthermore, going back to compact_zone() and compact_finished() when
pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
- the checks are meant to skip pageblocks quickly. The patch therefore
also introduces a simple loop into isolate_migratepages() so that it does
not return immediately on failed pageblock checks, but keeps going until
isolate_migratepages_range() gets called once. Similarily to
isolate_freepages(), the function periodically checks if it needs to
reschedule or abort async compaction.
[iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compact_stall vmstat counter counts the number of allocations stalled
by direct compaction. It does not count when all attempted zones had
deferred compaction, but it does count when all zones skipped compaction.
The skipping is decided based on very early check of
compaction_suitable(), based on watermarks and memory fragmentation.
Therefore it makes sense not to count skipped compactions as stalls.
Moreover, compact_success or compact_fail is also already not being
counted when compaction was skipped, so this patch changes the
compact_stall counting to match the other two.
Additionally, restructure __alloc_pages_direct_compact() code for better
readability.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When direct sync compaction is often unsuccessful, it may become deferred
for some time to avoid further useless attempts, both sync and async.
Successful high-order allocations un-defer compaction, while further
unsuccessful compaction attempts prolong the compaction deferred period.
Currently the checking and setting deferred status is performed only on
the preferred zone of the allocation that invoked direct compaction. But
compaction itself is attempted on all eligible zones in the zonelist, so
the behavior is suboptimal and may lead both to scenarios where 1)
compaction is attempted uselessly, or 2) where it's not attempted despite
good chances of succeeding, as shown on the examples below:
1) A direct compaction with Normal preferred zone failed and set
deferred compaction for the Normal zone. Another unrelated direct
compaction with DMA32 as preferred zone will attempt to compact DMA32
zone even though the first compaction attempt also included DMA32 zone.
In another scenario, compaction with Normal preferred zone failed to
compact Normal zone, but succeeded in the DMA32 zone, so it will not
defer compaction. In the next attempt, it will try Normal zone which
will fail again, instead of skipping Normal zone and trying DMA32
directly.
2) Kswapd will balance DMA32 zone and reset defer status based on
watermarks looking good. A direct compaction with preferred Normal
zone will skip compaction of all zones including DMA32 because Normal
was still deferred. The allocation might have succeeded in DMA32, but
won't.
This patch makes compaction deferring work on individual zone basis
instead of preferred zone. For each zone, it checks compaction_deferred()
to decide if the zone should be skipped. If watermarks fail after
compacting the zone, defer_compaction() is called. The zone where
watermarks passed can still be deferred when the allocation attempt is
unsuccessful. When allocation is successful, compaction_defer_reset() is
called for the zone containing the allocated page. This approach should
approximate calling defer_compaction() only on zones where compaction was
attempted and did not yield allocated page. There might be corner cases
but that is inevitable as long as the decision to stop compacting dues not
guarantee that a page will be allocated.
Due to a new COMPACT_DEFERRED return value, some functions relying
implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
more accurate. The did_some_progress output parameter of
__alloc_pages_direct_compact() is removed completely, as the caller
actually does not use it after compaction sets it - it is only considered
when direct reclaim sets it.
During testing on a two-node machine with a single very small Normal zone
on node 1, this patch has improved success rates in stress-highalloc
mmtests benchmark. The success here were previously made worse by commit
3a025760fc ("mm: page_alloc: spill to remote nodes before waking
kswapd") as kswapd was no longer resetting often enough the deferred
compaction for the Normal zone, and DMA32 zones on both nodes were thus
not considered for compaction. On different machine, success rates were
improved with __GFP_NO_KSWAPD allocations.
[akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
from gfp_mask in each retry pass, although the migratetype variable
already has the value determined and it does not change. Use the variable
and perform the check only once. Also convert #ifdef CONFIG_CMA to
IS_ENABLED.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
on ppc LPARs with memoryless nodes, a large amount of memory was consumed
by slabs and was marked unreclaimable. He tracked it down to slab
deactivations in the SLUB core when we allocate remotely, leading to poor
efficiency always when memoryless nodes are present.
After much discussion, Joonsoo provided a few patches that help
significantly. They don't resolve the problem altogether:
- memory hotplug still needs testing, that is when a memoryless node
becomes memory-ful, we want to dtrt
- there are other reasons for going off-node than memoryless nodes,
e.g., fully exhausted local nodes
Neither case is resolved with this series, but I don't think that should
block their acceptance, as they can be explored/resolved with follow-on
patches.
The series consists of:
[1/3] topology: add support for node_to_mem_node() to determine the
fallback node
[2/3] slub: fallback to node_to_mem_node() node if allocating on
memoryless node
- Joonsoo's patches to cache the nearest node with memory for each
NUMA node
[3/3] Partial revert of 81c98869fa (""kthread: ensure locality of
task_struct allocations")
- At Tejun's request, keep the knowledge of memoryless node fallback
to the allocator core.
This patch (of 3):
We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select the
node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_to_mem_node(X), will return a node Y
with memory that has the nearest distance. If X is memoryless node, it
will return nearest distance node, but, if X is normal node, it will
return itself.
We will use this function in following patch to determine the fallback
node.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zone allocation batches can easily underflow due to higher-order
allocations or spills to remote nodes. On SMP that's fine, because
underflows are expected from concurrency and dealt with by returning 0.
But on UP, zone_page_state will just return a wrapped unsigned long,
which will get past the <= 0 check and then consider the zone eligible
until its watermarks are hit.
Commit 3a025760fc ("mm: page_alloc: spill to remote nodes before
waking kswapd") already made the counter-resetting use
atomic_long_read() to accomodate underflows from remote spills, but it
didn't go all the way with it.
Make it clear that these batches are expected to go negative regardless
of concurrency, and use atomic_long_read() everywhere.
Fixes: 81c0a2bb51 ("mm: page_alloc: fair zone allocator policy")
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The setup_node_data() function allocates a pg_data_t object,
inserts it into the node_data[] array and initializes the
following fields: node_id, node_start_pfn and
node_spanned_pages.
However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
setup_node_data() is not used.
This causes a small glitch when running Linux as a hyperv numa
guest:
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
Initmem setup node 0 [mem 0x00000000-0x7fffffff]
NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
Initmem setup node 1 [mem 0x80800000-0x1081fffff]
NODE_DATA [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7
In this dmesg, the SRAT table reports that the memory range for
node 1 starts at 0x80200000. However, the line starting with
"Initmem" reports that node 1 memory range starts at 0x80800000.
The "Initmem" line is reported by setup_node_data() and is
wrong, because the kernel ends up using the range as reported in
the SRAT table.
This commit drops all that dead code from setup_node_data(),
renames it to alloc_node_data() and adds a printk() to
free_area_init_node() so that we report a node's memory range
accurately.
Here's the same dmesg section with this patch applied:
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
Initmem setup node 1 [mem 0x80200000-0x1081fffff]
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7
This commit was tested on a two node bare-metal NUMA machine and
Linux as a numa guest on hyperv and qemu/kvm.
PS: The wrong memory range reported by setup_node_data() seems to be
harmless in the current kernel because it's just not used. However,
that bad range is used in kernel 2.6.32 to initialize the old boot
memory allocator, which causes a crash during boot.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
or not, has gained more users. Their use is not necessarily wrong, they
are trying to do a memory allocation that can easily fail without
disturbing kswapd, so the bit has gained additional usecases.
This restructures the check to determine whether MIGRATE_SYNC_LIGHT
should be used for memory compaction in the page allocator. Rather than
testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
for thp allocations.
This also moves the check to be done only after the page allocator is
aborted for deferred or contended memory compaction since setting
migration_mode for this case is pointless.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
to imply that they require locking semantics to avoid out_of_memory()
being reordered.
zone_scan_lock is required for both functions to ensure that there is
proper locking synchronization.
Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
locking semantics.
At the same time, convert oom_zonelist_trylock() to return bool instead
of int since only success and failure are tested.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts are reset and a second attempt
made before entering the slow path.
One assumption made with this scheme is that batches expire at roughly
the same time and the resets each time are justified. This assumption
does not hold when zones reach their low watermark as the batches will
be consumed at uneven rates. Allocation failure due to watermark
depletion result in additional zonelist scans for the reset and another
watermark check before hitting the slowpath.
On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes. The system CPU overhead comparison looks like
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanilla vmstat-v5 lowercost-v5
User 746.94 774.56 802.00
System 65336.22 32847.27 40852.33
Elapsed 27553.52 27415.04 27368.46
However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of numa_zonelist_order=zone is to preserve lower zones for
use with 32-bit devices. If locality is preferred then the
numa_zonelist_order=node policy should be used.
Unfortunately, the fair zone allocation policy overrides this by
skipping zones on remote nodes until the lower one is found. While this
makes sense from a page aging and performance perspective, it breaks the
expected zonelist policy. This patch restores the expected behaviour
for zone-list ordering.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.
On a small UMA machine running tiobench the difference is marginal. On
a 4-node machine the overhead is more noticable. Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04
Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example
o The zone->node field is shared with the zone lock and zone->node is
accessed frequently from the page allocator due to the fair zone
allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so
reclaim-intensive and allocator-intensive workloads can bounce the cache
line on a stat update
This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines. Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.
On the test configuration I used the overall size of struct zone shrunk
by one cache line. On smaller machines, this is not likely to be
noticable. However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.
3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5r9
User 746.94 759.78
System 65336.22 58350.98
Elapsed 27553.52 27282.02
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields. With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.
This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change
type of batch local variable to match type of per_cpu_pages::batch
(which is int).
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is only called by mm/page_cgroup.c whcih cannot be modular.
Reported-by: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.
Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.
This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings and function name in mm/page_alloc.c:
Warning(..//mm/page_alloc.c:6074): No description found for parameter 'pfn'
Warning(..//mm/page_alloc.c:6074): No description found for parameter 'mask'
Warning(..//mm/page_alloc.c:6074): Excess function parameter 'start_bitidx' description in 'get_pfnblock_flags_mask'
Warning(..//mm/page_alloc.c:6102): No description found for parameter 'pfn'
Warning(..//mm/page_alloc.c:6102): No description found for parameter 'mask'
Warning(..//mm/page_alloc.c:6102): Excess function parameter 'start_bitidx' description in 'set_pfnblock_flags_mask'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
the following is triggered at early boot:
SMP: Total of 8 processors activated.
devtmpfs: initialized
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = fffffe0000050000
[00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
PC is at __list_add+0x10/0xd4
LR is at free_one_page+0x270/0x638
...
Call trace:
__list_add+0x10/0xd4
free_one_page+0x26c/0x638
__free_pages_ok.part.52+0x84/0xbc
__free_pages+0x74/0xbc
init_cma_reserved_pageblock+0xe8/0x104
cma_init_reserved_areas+0x190/0x1e4
do_one_initcall+0xc4/0x154
kernel_init_freeable+0x204/0x2a8
kernel_init+0xc/0xd4
This happens because init_cma_reserved_pageblock() calls
__free_one_page() with pageblock_order as page order but it is bigger
than MAX_ORDER. This in turn causes accesses past zone->free_list[].
Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.
In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
âpageblock_order > MAX_ORDERâ condition will be optimised out since both
sides of the operator are constants. In cases where pageblock size is
variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Mark Salter <msalter@redhat.com>
Tested-by: Christopher Covington <cov@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:
divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6
However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.
This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.
This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add_active_range() has been repalced by memblock_set_node(). Clean up the
comments to comply with that change.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_pageblock_migratetype() is called during free with IRQs disabled.
This is unnecessary and disables IRQs for longer than necessary.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.
In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.
In addition to the performance benefits, this patch closes races that are
possible between:
a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
reads part of the bits before and other part of the bits after
set_pageblock_migratetype() has updated them.
b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
read-modify-update set bit operation in set_pageblock_skip() will cause
lost updates to some bits changed in the set_pageblock_migratetype().
Joonsoo Kim first reported the case a) via code inspection. Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock). Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g. list_move(). Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.
That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove. Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.
After this patch, the race has no longer been observed in testing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
unlikely branch in the fast path. This patch moves the check out of the
fast path and after it has been determined that the watermarks have not
been met. This helps the common fast path at the cost of making the slow
path slower and hitting kswapd with a performance cost. It's a reasonable
tradeoff.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently it's calculated once per zone in the zonelist.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a zone cannot be used for a dirty page then it gets marked "full" which
is cached in the zlc and later potentially skipped by allocation requests
that have nothing to do with dirty zones.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that
potentially bounces cache line it's deceptively expensive and most
machines will not care. Only update the zlc if it was active.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Synchronous memory compaction can be very expensive: it can iterate an
enormous amount of memory without aborting, constantly rescheduling,
waiting on page locks and lru_lock, etc, if a pageblock cannot be
defragmented.
Unfortunately, it's too expensive for transparent hugepage page faults and
it's much better to simply fallback to pages. On 128GB machines, we find
that synchronous memory compaction can take O(seconds) for a single thp
fault.
Now that async compaction remembers where it left off without strictly
relying on sync compaction, this makes thp allocations best-effort without
causing egregious latency during fault. We still need to retry async
compaction after reclaim, but this won't stall for seconds.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to want to manipulate the migration mode for compaction in the
page allocator, and currently compact_control's sync field is only a bool.
Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
depending on the value of this bool. Convert the bool to enum
migrate_mode and pass the migration mode in directly. Later, we'll want
to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
avoid unnecessary latency.
This also alters compaction triggered from sysfs, either for the entire
system or for a node, to force MIGRATE_SYNC.
[akpm@linux-foundation.org: fix build]
[iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.
This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.
If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody outputs memory addresses in decimal. PFNs are essentially
addresses, and they're gibberish in decimal. Output them in hex.
Also, add the nid and zone name to give a little more context to the
message.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the MIGRATE_RESERVE pages, it is useful when they do not get
misplaced on free_list of other migratetype, otherwise they might get
allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks.
While this cannot be avoided completely when allocating new
MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
prevent the misplacement where possible.
Currently, it is possible for the misplacement to happen when a
MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
fallback for other desired migratetype, and then later freed back
through free_pcppages_bulk() without being actually used. This happens
because free_pcppages_bulk() uses get_freepage_migratetype() to choose
the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
the *desired* migratetype and not the page's original MIGRATE_RESERVE
migratetype.
This patch fixes the problem by moving the call to
set_freepage_migratetype() from rmqueue_bulk() down to
__rmqueue_smallest() and __rmqueue_fallback() where the actual page's
migratetype (e.g. from which free_list the page is taken from) is used.
Note that this migratetype might be different from the pageblock's
migratetype due to freepage stealing decisions. This is OK, as page
stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
to leave all MIGRATE_CMA pages on the correct freelist.
Therefore, as an additional benefit, the call to
get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
be removed completely. This relies on the fact that MIGRATE_CMA
pageblocks are created only during system init, and the above. The
related is_migrate_isolate() check is also unnecessary, as memory
isolation has other ways to move pages between freelists, and drain pcp
lists containing pages that should be isolated. The buffered_rmqueue()
can also benefit from calling get_freepage_migratetype() instead of
get_pageblock_migratetype().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
by zone_reclaim due to its distance. As it is expected that
zone_reclaim_mode will be rarely enabled it is unreasonable for all
machines to take a penalty. Fortunately, the zone_reclaim_mode() path
is already slow and it is the path that takes the hit.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.
Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.
This patch (of 2):
zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes. This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes. The NUMA penalties were
sufficiently high to justify reclaiming the memory. On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this. Favour the
common case and disable it by default. Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
allocated is then to be freed by free_memcg_kmem_pages. Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path. So let's introduce separate functions that will
alloc/free pages charged to kmemcg.
The new functions are called alloc_kmem_pages and free_kmem_pages. They
should be used when the caller actually would like to use kmalloc, but
has to fall back to the page allocator for the allocation is large.
They only differ from alloc_pages and free_pages in that besides
allocating or freeing pages they also charge them to the kmem resource
counter of the current memory cgroup.
[sfr@canb.auug.org.au: export kmalloc_order() to modules]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new dump_page() routine was recently added, and marked
EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE()
macro, and so the end result is that non-GPL code can no longer call
get_page() and a few other routines.
This only happens if the kernel was compiled with CONFIG_DEBUG_VM.
Change dump_page() to be EXPORT_SYMBOL.
Longer explanation:
Prior to commit 309381feae ("mm: dump page when hitting a VM_BUG_ON
using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
drivers on Fedora. Fedora is semi-unique, in that it sets
CONFIG_VM_DEBUG.
Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's
new, open source, "UVM-Lite" kernel module, I originally choose to use
the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
because get_page() has always seemed to be very clearly intended for use
by non-GPL, driver code.
So I'm hoping that making get_page() widely accessible again will not be
too controversial. We did check with Fedora first, and they responded
(https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
try to get upstream changed, before asking Fedora to change. Their
reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
Fedora to help catch mm bugs.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a small cleanup.
Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On NUMA systems, a node may start thrashing cache or even swap anonymous
pages while there are still free pages on remote nodes.
This is a result of commits 81c0a2bb51 ("mm: page_alloc: fair zone
allocator policy") and fff4068cba ("mm: page_alloc: revert NUMA aspect
of fair allocation policy").
Before those changes, the allocator would first try all allowed zones,
including those on remote nodes, before waking any kswapds. But now,
the allocator fastpath doubles as the fairness pass, which in turn can
only consider the local node to prevent remote spilling based on
exhausted fairness batches alone. Remote nodes are only considered in
the slowpath, after the kswapds are woken up. But if remote nodes still
have free memory, kswapd should not be woken to rebalance the local node
or it may thrash cash or swap prematurely.
Fix this by adding one more unfair pass over the zonelist that is
allowed to spill to remote nodes after the local fairness pass fails but
before entering the slowpath and waking the kswapds.
This also gets rid of the GFP_THISNODE exemption from the fairness
protocol because the unfair pass is no longer tied to kswapd, which
GFP_THISNODE is not allowed to wake up.
However, because remote spills can be more frequent now - we prefer them
over local kswapd reclaim - the allocation batches on remote nodes could
underflow more heavily. When resetting the batches, use
atomic_long_read() directly instead of zone_page_state() to calculate the
delta as the latter filters negative counter values.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
warning:
warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
Let's convert 'reason' to 'const char *' in dump_page() and friends: we
shouldn't modify it anyway.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out. In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM. Although this sounds like a bug
somewhere in the kswapd vs. zone reclaim vs. direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10
So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory. Node distances cause an
automatic zone_reclaim_mode enabling.
Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes. So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.
Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.
This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free, and
bisected the problem down to commit 81c0a2bb51 ("mm: page_alloc: fair
zone allocator policy").
The problem is that GFP_THISNODE obeys the zone fairness allocation
batches on one hand, but doesn't reset them and wake kswapd on the other
hand. After a few of those allocations, the batches are exhausted and
the allocations fail.
Fixing this means either having GFP_THISNODE wake up kswapd, or
GFP_THISNODE not participating in zone fairness at all. The latter
seems safer as an acute bugfix, we can clean up later.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).
This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.
This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.
This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.
Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
min_free_kbytes may be raised during THP's initialization. Sometimes,
this will change the value which was set by the user. Showing this
message will clarify this confusion.
Only show this message when changing a value which was set by the user
according to Michal Hocko's suggestion.
Show the old value of min_free_kbytes according to Dave Hansen's
suggestion. This will give user the chance to restore old value of
min_free_kbytes.
Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
proc_dointvec() to proc_dointvec_minmax() in the
min_free_kbytes_sysctl_handler() can prevent this to happen.
mhocko said:
: You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
: your machine unusable but I agree that proc_dointvec_minmax is more
: suitable here as we already have:
:
: .proc_handler = min_free_kbytes_sysctl_handler,
: .extra1 = &zero,
:
: It used to work properly but then 6fce56ec91 ("sysctl: Remove references
: to ctl_name and strategy from the generic sysctl table") has removed
: sysctl_intvec strategy and so extra1 is ignored.
Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.
I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.
This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.
[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether ->index or ->mapping is required to be NULL.
This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places. It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.
This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.
[akpm@linux-foundation.org: switch to pr_alert]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.
Luckily, nothing currently does such craziness. So instead of causing
such allocations to loop (potentially forever), we maintain the current
behavior and also warn about the new users of the deprecated flag.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there are several functions to manipulate the deferred
compaction state variables. The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.
Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code. No functional
change.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch to memblock interfaces for early memory allocator instead of
bootmem allocator. No functional change in beahvior than what it is in
current code from bootmem users points of view.
Archs already converted to NO_BOOTMEM now directly use memblock
interfaces instead of bootmem wrappers build on top of memblock. And
the archs which still uses bootmem, these new apis just fallback to
exiting bootmem APIs.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If users specify the original movablecore=nn@ss boot option, the kernel
will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot
option is similar except it specifies ZONE_NORMAL ranges.
Now, if users specify "movable_node" in kernel commandline, the kernel
will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users
do this, all the other movablecore=nn@ss and kernelcore=nn@ss options
should be ignored.
For those who don't want this, just specify nothing. The kernel will
act as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4b59e6c473 ("mm, show_mem: suppress page counts in
non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to
suppress PFN walks on large memory machines. Commit c78e93630d ("mm:
do not walk all of system memory during show_mem") avoided a PFN walk in
the generic show_mem helper which removes the requirement for
SHOW_MEM_FILTER_PAGE_COUNT in that case.
This patch removes PFN walkers from the arch-specific implementations
that report on a per-node or per-zone granularity. ARM and unicore32
still do a PFN walk as they report memory usage on each bank which is a
much finer granularity where the debugging information may still be of
use. As the remaining arches doing PFN walks have relatively small
amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.
[akpm@linux-foundation.org: fix parisc]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: James Bottomley <jejb@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
9TB memory machine since onlining memory sections is too slow. And we
found out setup_zone_migrate_reserve spent >90% of the time.
The problem is, setup_zone_migrate_reserve scans all pageblocks
unconditionally, but it is only necessary if the number of reserved
block was reduced (i.e. memory hot remove).
Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
that the number of reserved pageblocks is almost always unchanged.
This patch adds zone->nr_migrate_reserve_block to maintain the number of
MIGRATE_RESERVE pageblocks and it reduces the overhead of
setup_zone_migrate_reserve dramatically. The following table shows time
of onlining a memory section.
Amount of memory | 128GB | 192GB | 256GB|
---------------------------------------------
linux-3.12 | 23.9 | 31.4 | 44.5 |
This patch | 8.3 | 8.3 | 8.6 |
Mel's proposal patch | 10.9 | 19.2 | 31.3 |
---------------------------------------------
(millisecond)
128GB : 4 nodes and each node has 32GB of memory
192GB : 6 nodes and each node has 32GB of memory
256GB : 8 nodes and each node has 32GB of memory
(*1) Mel proposed his idea by the following threads.
https://lkml.org/lkml/2013/10/30/272
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 81c0a2bb51 ("mm: page_alloc: fair zone allocator policy") meant
to bring aging fairness among zones in system, but it was overzealous
and badly regressed basic workloads on NUMA systems.
Due to the way kswapd and page allocator interacts, we still want to
make sure that all zones in any given node are used equally for all
allocations to maximize memory utilization and prevent thrashing on the
highest zone in the node.
While the same principle applies to NUMA nodes - memory utilization is
obviously improved by spreading allocations throughout all nodes -
remote references can be costly and so many workloads prefer locality
over memory utilization. The original change assumed that
zone_reclaim_mode would be a good enough predictor for that, but it
turned out to be as indicative as a coin flip.
Revert the NUMA aspect of the fairness until we can find a proper way to
make it configurable and agree on a sane default.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> # 3.12
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 73f038b863. The NUMA behaviour of this patch is
less than ideal. An alternative approch is to interleave allocations
only within local zones which is implemented in the next patch.
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
commit 81c0a2bb51 ("mm: page_alloc: fair zone allocator policy").
That change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator and
slab.
The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone. It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.
Only round-robin allocations that are usually freed through page reclaim
or slab shrinking.
Bisected by Dave Hansen.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When __rmqueue_fallback() doesn't find a free block with the required size
it splits a larger page and puts the rest of the page onto the free list.
But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However,
__rmqueue_fallback() is only called when all of the start_migratetype
queue is empty. That said, __rmqueue_fallback always puts back memory to
the wrong queue except try_to_steal_freepages() changed pageblock type
(i.e. requested size is smaller than half of page block). The end result
is that the antifragmentation framework increases fragmenation instead of
decreasing it.
Mel's original anti fragmentation does the right thing. But commit
47118af076 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.
This patch restores sane and old behavior. It also removes an incorrect
comment which was introduced by commit fef903efcf ("mm/page_alloc.c:
restructure free-page stealing code and fix a bug").
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.
However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose. This patch does it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
the argument to MIGRATE_UNMOVABLE and we lost these attribute.
The problem was introduced by commit 49255c619f ("page allocator: move
check for disabled anti-fragmentation out of fastpath"). So a 4 year
old issue may mean that nobody uses page_group_by_mobility_disabled.
But anyway, this patch fixes the problem.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use helper function to check if we need to deal with oom condition.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
Simplify the code, no functional change.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
[ Fixed build failure on 32-bit systems. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.
To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.
To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.
First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.
The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.
The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.
Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit cea27eb2a2 ("mm/memory-hotplug: fix lowmem count
overflow when offline pages").
The fixed bug by commit cea27eb was fixed to another way by commit
3dcc0571cd ("mm: correctly update zone->managed_pages"). That commit
enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory, for details please refer to:
http://marc.info/?l=linux-mm&m=136957578620221&w=2
As a result, commit cea27eb2a2 currently causes duplicated decreasing
of totalhigh_pages, thus the revert.
Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
magic number -2.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.
Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:
kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.
Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.
In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737
We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.
Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.
[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Neil Zhang <zhangwm@marvell.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
comment is moved to __cpuset_node_allowed_softwall. So fix this comment.
Signed-off-by: SeungHun Lee <waydi1@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.
We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.
Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().
3.11-rc5 with patch difference (%)
-------- ---------- --------------
UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
Time in seconds.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.
What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().
Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().
As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.
[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
Simplify the code, no functional change.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main idea behind this patchset is to reduce the vmstat update overhead
by avoiding interrupt enable/disable and the use of per cpu atomics.
This patch (of 3):
It is better to have a separate folding function because
refresh_cpu_vm_stats() also does other things like expire pages in the
page allocator caches.
If we have a separate function then refresh_cpu_vm_stats() is only called
from the local cpu which allows additional optimizations.
The folding function is only called when a cpu is being downed and
therefore no other processor will be accessing the counters. Also
simplifies synchronization.
[akpm@linux-foundation.org: fix UP build]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
path. For helping compiler optimization, add unlikely macro to
ALLOC_NO_WATERMARKS checking.
This patch doesn't have any effect now, because gcc already optimize this
properly. But we cannot assume that gcc always does right and nobody
re-evaluate if gcc do proper optimization with their change, for example,
it is not optimized properly on v3.10. So adding compiler hint here is
reasonable.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.
But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.
The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.
Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021
Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937
When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocations that do not have to respect the watermarks are rare
high-priority events. Reorder the code such that per-zone dirty limits
and future checks important only to regular page allocations are ignored
in these extraordinary situations.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not check loop+1 with loop end in loop body. Just duplicate two
lines code to avoid it.
That will help a bit when we have huge amount of pages on system with
16TiB memory.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the
migratetype *after* it has been set to the preferred migratetype (if the
ownership was changed). Obviously that wouldn't have been the original
intent. (We already have a separate 'change_ownership' field to tell
whether the ownership of the pageblock was changed from the
fallback_migratetype to the preferred type.)
The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The free-page stealing code in __rmqueue_fallback() is somewhat hard to
follow, and has an incredible amount of subtlety hidden inside!
First off, there is a minor bug in the reporting of change-of-ownership of
pageblocks. Under some conditions, we try to move upto
'pageblock_nr_pages' no. of pages to the preferred allocation list. But
we change the ownership of that pageblock to the preferred type only if we
manage to successfully move atleast half of that pageblock (or if
page_group_by_mobility_disabled is set).
However, the current code ignores the latter part and sets the
'migratetype' variable to the preferred type, irrespective of whether we
actually changed the pageblock migratetype of that block or not. So, the
page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
'change_ownership' might be shown as 1 when it must have been 0).
So fixing this involves moving the update of the 'migratetype' variable to
the right place. But looking closer, we observe that the 'migratetype'
variable is used subsequently for checks such as "is_migrate_cma()".
Obviously the intent there is to check if the *fallback* type is
MIGRATE_CMA, but since we already set the 'migratetype' variable to
start_migratetype, we end up checking if the *preferred* type is
MIGRATE_CMA!!
To make things more interesting, this actually doesn't cause a bug in
practice, because we never change *anything* if the fallback type is CMA.
So, restructure the code in such a way that it is trivial to understand
what is going on, and also fix the above mentioned bug. And while at it,
also add a comment explaining the subtlety behind the migratetype used in
the call to expand().
[akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix all errors reported by checkpatch and some small spelling mistakes.
Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
set_pageblock_order() may be called when memory hotplug, so need use
'__paginginit' instead of '__init'.
The related warning:
The function __meminit .free_area_init_node() references
a function __init .set_pageblock_order().
If .set_pageblock_order is only used by .free_area_init_node then
annotate .set_pageblock_order with a matching annotation.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
previous calculating here will generate an unexpected result. In
addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
already integral multiple of 1MB.
Signed-off-by: Jerry Zhou <uulinux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
min_free_kbytes is updated during memory hotplug (by
init_per_zone_wmark_min) currently which is right thing to do in most
cases but this could be unexpected if admin increased the value to
prevent from allocation failures and the new min_free_kbytes would be
decreased as a result of memory hotadd.
This patch saves the user defined value and allows updating
min_free_kbytes only if it is higher than the saved one.
A warning is printed when the new value is ignored.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
the order passed. MAX_ORDER is typically 11 and pageblock_order is
typically 9 on x86. Integer division truncates, so pageblock_order / 2
is 4. For the first eight iterations, it's guaranteed that
current_order >= pageblock_order / 2 if it even gets that far!
So just remove the unlikely(), it's completely bogus.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
zone_type argument, so we can directly use the value in
build_zonelists_node and remove zone_type argument.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calculating pages in a node, for each zone in that node, we will
have
zone_spanned_pages_in_node
--> get_pfn_range_for_nid
zone_absent_pages_in_node
--> get_pfn_range_for_nid
That is to say, we call the get_pfn_range_for_nid to get start_pfn and
end_pfn of the node for MAX_NR_ZONES * 2 times. And this is totally
unnecessary if we call the get_pfn_range_for_nid before
zone_*_pages_in_node add two extra arguments node_start_pfn and
node_end_pfn for zone_*_pages_in_node, then we can remove the
get_pfn_range_in_node in zone_*_pages_in_node.
[akpm@linux-foundation.org: make definitions more readable]
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce helper function mem_init_print_info() to simplify mem_init()
across different architectures, which also unifies the format and
information printed.
Function mem_init_print_info() calculates memory statistics information
without walking each page, so it should be a little faster on some
architectures.
Also introduce another helper get_num_physpages() to kill the global
variable num_physpages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by https://bugzilla.kernel.org/show_bug.cgi?id=53501,
"MemTotal" from /proc/meminfo means memory pages managed by the buddy
system (managed_pages), but "MemTotal" from /sys/.../node/nodex/meminfo
means physical pages present (present_pages) within the NUMA node.
There's a difference between managed_pages and present_pages due to
bootmem allocator and reserved pages.
And Documentation/filesystems/proc.txt says
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)
So change /sys/.../node/nodex/meminfo to report available pages within
the node as "MemTotal".
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: <sworddragon2@aol.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enhance adjust_managed_page_count() to adjust totalhigh_pages for
highmem pages. And change code which directly adjusts totalram_pages to
use adjust_managed_page_count() because it adjusts totalram_pages,
totalhigh_pages and zone->managed_pages altogether in a safe way.
Remove inc_totalhigh_pages() and dec_totalhigh_pages() from xen/balloon
driver bacause adjust_managed_page_count() has already adjusted
totalhigh_pages.
This patch also fixes two bugs:
1) enhances virtio_balloon driver to adjust totalhigh_pages when
reserve/unreserve pages.
2) enhance memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory.
We still need to deal with modifications of totalram_pages in file
arch/powerpc/platforms/pseries/cmm.c, but need help from PPC experts.
[akpm@linux-foundation.org: remove ifdef, per Wanpeng Li, virtio_balloon.c cleanup, per Sergei]
[akpm@linux-foundation.org: export adjust_managed_page_count() to modules, for drivers/virtio/virtio_balloon.c]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to simpilify management of totalram_pages and
zone->managed_pages, make __free_pages_bootmem() only available at boot
time. With this change applied, __free_pages_bootmem() will only be
used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
Other callers of __free_pages_bootmem() have been converted to use
free_reserved_page(), which handles totalram_pages and
zone->managed_pages in a safer way.
This patch also fix a bug in free_pagetable() for x86_64, which should
increase zone->managed_pages instead of zone->present_pages when freeing
reserved pages.
And now we have managed_pages_count_lock to protect totalram_pages and
zone->managed_pages, so remove the redundant ppb_lock lock in
put_page_bootmem(). This greatly simplifies the locking rules.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
protect totalram_pages and zone->managed_pages. Other than the memory
hotplug driver, totalram_pages and zone->managed_pages may also be
modified at runtime by other drivers, such as Xen balloon,
virtio_balloon etc. For those cases, memory hotplug lock is a little
too heavy, so introduce a dedicated lock to protect totalram_pages and
zone->managed_pages.
Now we have a simplified locking rules totalram_pages and
zone->managed_pages as:
1) no locking for read accesses because they are unsigned long.
2) no locking for write accesses at boot time in single-threaded context.
3) serialize write accesses at runtime by acquiring the dedicated
managed_page_count_lock.
Also adjust zone->managed_pages when freeing reserved pages into the
buddy system, to keep totalram_pages and zone->managed_pages in
consistence.
[akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
that all highmem pages will be freed into the buddy system by function
mem_init(). But that's not always true, some architectures may reserve
some highmem pages during boot. For example PPC may allocate highmem
pages for giagant HugeTLB pages, and several architectures have code to
check PageReserved flag to exclude highmem pages allocated during boot
when freeing highmem pages into the buddy system.
So treat highmem pages in the same way as normal pages, that is to:
1) reset zone->managed_pages to zero in mem_init().
2) recalculate managed_pages when freeing pages into the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Address more review comments from last round of code review.
1) Enhance free_reserved_area() to support poisoning freed memory with
pattern '0'. This could be used to get rid of poison_init_mem()
on ARM64.
2) A previous patch has disabled memory poison for initmem on s390
by mistake, so restore to the original behavior.
3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:
arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
Also address some minor code review comments.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- check the length of the procfs data before copying it into a fixed
size array.
- when __parse_numa_zonelist_order() fails, save the error code for
return.
- 'char*' --> 'char *' coding style fix
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both the
leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.
Avoid this by factoring out pageset_set_high_and_batch() (which contains
all needed logic too set a pageset's ->high and ->batch inrespective of
system state) from zone_pageset_init() and using the new
pageset_set_high_and_batch() instead of zone_pageset_init() in
zone_pcp_update().
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages. We don't need to drain
the entire percpu pageset just to modify these fields.
This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).
This does change the behavior of zone_pcp_update() as the percpu pagesets
will not be drained when zone_pcp_update() is called (they will end up
being shrunk, not completely drained, later when a 0-order page is freed
in free_hot_cold_page()).
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pcp->batch could change at any point, avoid relying on it being a stable
value.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.
This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch. It avoids ->batch ever rising above ->high.
Suggested by Gilad Ben-Yossef in these threads:
https://lkml.org/lkml/2013/4/9/23https://lkml.org/lkml/2013/4/10/49
Also reproduces his proposed comment.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because we are going to rely upon a careful transision between old and new
->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.
Add a simple mutex to protect both update loops.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"Problems" with the current code:
1: there is a lack of synchronization in setting ->high and ->batch in
percpu_pagelist_fraction_sysctl_handler()
2: stop_machine() in zone_pcp_update() is unnecissary.
3: zone_pcp_update() does not consider the case where
percpu_pagelist_fraction is non-zero
To fix:
1: add memory barriers, a safe ->batch value, an update side mutex when
updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
that expect a stable value.
2: avoid draining pages in zone_pcp_update(), rely upon the memory
barriers added to fix#1
3: factor out quite a few functions, and then call the appropriate one.
Note that it results in a change to the behavior of zone_pcp_update(),
which is used by memory_hotplug. I'm rather certain that I've diserned
(and preserved) the essential behavior (changing ->high and ->batch), and
only eliminated unneeded actions (draining the per cpu pages), but this
may not be the case.
Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to
offline a page, and after the offline has "succeeded". It appears that
the draining was added to zone_pcp_update() to avoid refactoring
setup_pageset() into 2 funtions.
This patch:
Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high based
on ->batch.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The watermark check consists of two sub-checks. The first one is:
if (free_pages <= min + lowmem_reserve)
return false;
The check assures that there is minimal amount of RAM in the zone. If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.
if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
This prevents the zone from being drained from pages available for
non-movable allocations.
The second check prevents the zone from getting too fragmented.
for (o = 0; o < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages <= min)
return false;
}
The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.
This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.
Laura said:
We were observing allocation failures of higher order pages (order 5 =
128K typically) under tight memory conditions resulting in driver
failure. The output from the page allocation failure showed plenty of
free pages of the appropriate order/type/zone and mostly CMA pages in
the lower orders.
For full disclosure, we still observed some page allocation failures
even after applying the patch but the number was drastically reduced and
those failures were attributed to fragmentation/other system issues.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Tested-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org> [3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
virt_to_page() is typically implemented as a macro containing a cast so
that it will accept both pointers and unsigned long without causing a
warning.
But MIPS virt_to_page() uses virt_to_phys which is a function so passing
an unsigned long will cause a warning:
CC mm/page_alloc.o
mm/page_alloc.c: In function ‘free_reserved_area’:
mm/page_alloc.c:5161:3: warning: passing argument 1 of ‘virt_to_phys’ makes pointer from integer without a cast [enabled by default]
arch/mips/include/asm/io.h:119:100: note: expected ‘const volatile void *’ but argument is of type ‘long unsigned int’
All others users of virt_to_page() in mm/ are passing a void *.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reported-by: Eunbong Song <eunb.song@samsung.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
powerpc and x86 were opencoding copies of setup_nr_node_ids(), which
page_alloc provides but makes static. Make it avaliable to the archs in
linux/mm.h.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When booting on a large memory system, the kernel spends considerable
time in memmap_init_zone() setting up memory zones. Analysis shows
significant time spent in __early_pfn_to_nid().
The routine memmap_init_zone() checks each PFN to verify the nid is
valid. __early_pfn_to_nid() sequentially scans the list of pfn ranges
to find the right range and returns the nid. This does not scale well.
On a 4 TB (single rack) system there are 308 memory ranges to scan. The
higher the PFN the more time spent sequentially spinning through memory
ranges.
Since memmap_init_zone() increments pfn, it will almost always be
looking for the same range as the previous pfn, so check that range
first. If it is in the same range, return that nid. If not, scan the
list as before.
A 4 TB (single rack) UV1 system takes 512 seconds to get through the
zone code. This performance optimization reduces the time by 189
seconds, a 36% improvement.
A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
a 112.9 second (53%) reduction.
[akpm@linux-foundation.org: make the statics __meminitdata]
[akpm@linux-foundation.org: fix comment formatting]
[akpm@linux-foundation.org: fix ia64, per yinghai]
[akpm@linux-foundation.org: add missing semicolon, per Tony]
Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Tested-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following problem was reported against a distribution kernel when
zone_reclaim was enabled but the same problem applies to the mainline
kernel. The reproduction case was as follows
1. Run numactl -m +0 dd if=largefile of=/dev/null
This allocates a large number of clean pages in node 0
2. numactl -N +0 memhog 0.5*Mg
This start a memory-using application in node 0.
The expected behaviour is that the clean pages get reclaimed and the
application uses node 0 for its memory. The observed behaviour was that
the memory for the memhog application was allocated off-node since
commits cd38b115d5 ("mm: page allocator: initialise ZLC for first zone
eligible for zone_reclaim") and commit 76d3fbf8fb ("mm: page
allocator: reconsider zones for allocation after direct reclaim").
The assumption of those patches was that it was always preferable to
allocate quickly than stall for long periods of time and they were meant
to take care that the zone was only marked full when necessary but an
important case was missed.
In the allocator fast path, only the low watermarks are checked. If the
zones free pages are between the low and min watermark then allocations
from the allocators slow path will succeed. However, zone_reclaim will
only reclaim SWAP_CLUSTER_MAX or 1<<order pages. There is no guarantee
that this will meet the low watermark causing the zone to be marked full
prematurely.
This patch will only mark the zone full after zone_reclaim if it the min
watermarks are checked or if page reclaim failed to make sufficient
progress.
[mhocko@suse.cz: fix alloc_flags test]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Hedi Berriche <hedi@sgi.com>
Tested-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Particularly in oom conditions, it's troublesome that hugetlb memory is
not displayed. All other meminfo that is emitted will not add up to
what is expected, and there is no artifact left in the kernel log to
show that a potentially significant amount of memory is actually
allocated as hugepages which are not available to be reclaimed.
Booting with hugepages=8192 on the command line, this memory is now
shown in oom conditions. For example, with echo m >
/proc/sysrq-trigger:
Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501
Now it has also been expanded to reduce common code used by memory
initializion.
This is the second part, which applies to the previous part at:
http://marc.info/?l=linux-mm&m=136289696323825&w=2
It introduces a helper function free_highmem_page() to free highmem
pages into the buddy system when initializing mm subsystem.
Introduction of free_highmem_page() is one step forward to clean up
accesses and modificaitons of totalhigh_pages, totalram_pages and
zone->managed_pages etc. I hope we could remove all references to
totalhigh_pages from the arch/ subdirectory.
We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help
to test this patchset are welcomed!
There are several other parts still under development:
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.
This patch:
Introduce helper function free_highmem_page(), which will be used by
architectures with HIGHMEM enabled to free highmem pages into the buddy
system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Attilio Rao <attilio.rao@citrix.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cong Wang <amwang@redhat.com>
Cc: David Daney <david.daney@cavium.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501
Now it has also been expanded to reduce common code used by memory
initializion.
This is the first part, which applies to v3.9-rc1.
It introduces following common helper functions to simplify
free_initmem() and free_initrd_mem() on different architectures:
adjust_managed_page_count():
will be used to adjust totalram_pages, totalhigh_pages,
zone->managed_pages when reserving/unresering a page.
__free_reserved_page():
free a reserved page into the buddy system without adjusting
page statistics info
free_reserved_page():
free a reserved page into the buddy system and adjust page
statistics info
mark_page_reserved():
mark a page as reserved and adjust page statistics info
free_reserved_area():
free a continous ranges of pages by calling free_reserved_page()
free_initmem_default():
default method to free __init pages.
We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help to
test this patchset are welcomed!
There are several other parts still under development:
Part2: introduce free_highmem_page() to simplify freeing highmem pages
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.
This patch:
Code to deal with reserved/managed pages are duplicated by many
architectures, so introduce common help functions to reduce duplicated
code. These common help functions will also be used to concentrate code
to modify totalram_pages and zone->managed_pages, which makes the code
much more clear.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On large systems with a lot of memory, walking all RAM to determine page
types may take a half second or even more.
In non-blockable contexts, the page allocator will emit a page allocation
failure warning unless __GFP_NOWARN is specified. In such contexts, irqs
are typically disabled and such a lengthy delay may even result in NMI
watchdog timeouts.
To fix this, suppress the page walk in such contexts when printing the
page allocation failure warning.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This symbol will be used in the Hyper-V balloon driver to support 2M
allocations.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5
Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")
It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.
If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.
We have workaround patch that could fix some problems, but some can not
be fixed.
So just remove that offending commit and related ones including:
f7210e6c4a ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")
01a178a94e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")
27168d38fa ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")
e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")
fb06bc8e5f ("page_alloc: bootmem limit with movablecore_map")
42f47e27e7 ("page_alloc: make movablemem_map have higher priority")
6981ec3114 ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")
34b71f1e04 ("page_alloc: add movable_memmap kernel parameter")
4d59a75125 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lockdep, but it's a mechanical change.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf
4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf
3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6
eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee
ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn
4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC
Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1
AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO
32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv
lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO
rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK
+GszxsFkCjlW0mK0egTb
=tiSY
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s <signature> option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.
nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
are horribly badly named, so accurately document them with code comments
in case of the misuse of them.
[akpm@linux-foundation.org: tweak comments]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the amount of RAM that functions nr_free_*_pages return is
held in unsigned int. But in machines with big memory (exceeding 16TB),
the amount may be incorrect because of overflow, so fix it.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make a sweep through mm/ and convert code that uses -1 directly to using
the more appropriate NUMA_NO_NODE.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a debug message which prints when a page is found outside of the
boundaries of the zone it should belong to. Format is:
"page $pfn outside zone [ $start_pfn - $end_pfn ]"
[akpm@linux-foundation.org: s/pr_debug/pr_err/]
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Freeing pages to uninitialized zones is not handled by
__free_one_page(), and should never happen when the code is correct.
Ran into this while writing some code that dynamically onlines extra
zones.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.
This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.
Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern. As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset(). There are others like init_page_count() but as
it is used throughout the arch code a rename would likely cause more
conflicts than it is worth.
[akpm@linux-foundation.org: fix zcache]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)
- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found. Unfortunately, this allocation can only
be found by human being now, and there should be many not found since
any function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:
present_pages = spanned_pages - absent_pages;
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We now provide an option for users who don't want to specify physical
memory address in kernel commandline.
/*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/
So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.
If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel. But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on. We cannot prevent kernel from using these memory. So we need to
exclude these ranges even if these memory is hotpluggable.
Furthermore, there could be several memory ranges in the single node
which the kernel resides in. We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot. So, make the whole node which the kernel
resides in un-hotpluggable. Then the kernel has enough memory to use.
NOTE: Using this way will cause NUMA performance down because the
whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
on it. If users don't want to lose NUMA performance, just don't use
it.
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as
ZONE_MOVABLE.
Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so
that we can use it to prevent memblock from allocating memory in the
ranges user didn't specify.
We now implement movablemem_map boot option like this:
/*
* For movablemem_map=nn[KMG]@ss[KMG]:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* user specified: |__| |___|
* movablemem_map: |___| |_________| |______| ......
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*
* NOTE: In this case, SRAT info will be ingored.
*/
[akpm@linux-foundation.org: clean up code, fix build warning]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
limit from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of
ZONE_MOVABLE for each node.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Liu Jiang <jiang.liu@huawei.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add functions to parse movablemem_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablemem_map.map array.
And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.
[akpm@linux-foundation.org: improve comment]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: remove unneeded parens]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`int' is an inappropriate type for a number-of-pages counter.
While we're there, use the clamp() macro.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.
[akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler changes from Ingo Molnar:
"Main changes:
- scheduler side full-dynticks (user-space execution is undisturbed
and receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready, from Frederic
Weisbecker.
- Initial sched.h split-up changes, by Clark Williams
- select_idle_sibling() performance improvement by Mike Galbraith:
" 1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs "
- sched_rr_get_interval() ABI fix/change. We think this detail is not
used by apps (so it's not an ABI in practice), but lets keep it
under observation.
- misc RT scheduling cleanups, optimizations"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
cputime: Remove irqsave from seqlock readers
sched, powerpc: Fix sched.h split-up build failure
cputime: Restore CPU_ACCOUNTING config defaults for PPC64
sched/rt: Move rt specific bits into new header file
sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
sched: Move sched.h sysctl bits into separate header
sched: Fix signedness bug in yield_to()
sched: Fix select_idle_sibling() bouncing cow syndrome
sched/rt: Further simplify pick_rt_task()
sched/rt: Do not account zero delta_exec in update_curr_rt()
cputime: Safely read cputime of full dynticks CPUs
kvm: Prepare to add generic guest entry/exit callbacks
cputime: Use accessors to read task cputime stats
cputime: Allow dynamic switch between tick/virtual based cputime accounting
cputime: Generic on-demand virtual cputime accounting
cputime: Move default nsecs_to_cputime() to jiffies based cputime file
cputime: Librarize per nsecs resolution cputime definitions
cputime: Avoid multiplication overflow on utime scaling
context_tracking: Export context state for generic vtime
...
Fix up conflict in kernel/context_tracking.c due to comment additions.
Commit c060f943d0 ("mm: use aligned zone start for pfn_to_bitidx
calculation") fixed out calculation of the index into the pageblock
bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.
However, the _allocation_ of that bitmap had never taken this alignment
requirement into accout, so depending on the exact size and alignment of
the zone, the use of that index could then access past the allocation,
resulting in some very subtle memory corruption.
This was reported (and bisected) by Ingo Molnar: one of his random
config builds would hang with certain very specific kernel command line
options.
In the meantime, commit c060f943d0 has been marked for stable, so this
fix needs to be back-ported to the stable kernels that backported the
commit to use the right alignment.
Bisected-and-tested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The total number of low memory pages is determined as totalram_pages -
totalhigh_pages, so without this patch all CMA pageblocks placed in
highmem were accounted to low memory.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket. It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e ("mm: compaction: capture a suitable high-order page
immediately when it is made available").
The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed. For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped. However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.
In retrospect the capture patch took the wrong approach. What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way. While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested. This patch
partially reverts commit 1fb3f8ca0e ("mm: compaction: capture a
suitable high-order page immediately when it is made available").
Reported-and-tested-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current calculation in pfn_to_bitidx assumes that (pfn -
zone->zone_start_pfn) >> pageblock_order will return the same bit for
all pfn in a pageblock. If zone_start_pfn is not aligned to
pageblock_nr_pages, this may not always be correct.
Consider the following with pageblock order = 10, zone start 2MB:
pfn | pfn - zone start | (pfn - zone start) >> page block order
----------------------------------------------------------------
0x26000 | 0x25e00 | 0x97
0x26100 | 0x25f00 | 0x97
0x26200 | 0x26000 | 0x98
0x26300 | 0x26100 | 0x98
This means that calling {get,set}_pageblock_migratetype on a single page
will not set the migratetype for the full block. Fix this by rounding
down zone_start_pfn when doing the bitidx calculation.
For our use case, the effects of this bug were mostly tied to the fact
that CMA allocations would either take a long time or fail to happen.
Depending on the driver using CMA, this could result in anything from
visual glitches to application failures.
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 702d1a6e07 ("memory-hotplug: fix kswapd looping forever
problem") added an isolated pageblocks counter (nr_pageblock_isolate in
struct zone) and used it to adjust free pages counter in
zone_watermark_ok_safe() to prevent kswapd looping forever problem.
Then later, commit 2139cbe627 ("cma: fix counting of isolated pages")
fixed accounting of isolated pages in global free pages counter. It
made the previous zone_watermark_ok_safe() fix unnecessary and
potentially harmful (cause now isolated pages may be accounted twice
making free pages counter incorrect).
This patch removes the special isolated pageblocks counter altogether
which fixes zone_watermark_ok_safe() free pages check.
Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory returned to free_contig_range() must have no other references.
Let kernel to complain loudly if page reference count is not equal to 1.
[rientjes@google.com: support sparsemem]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.
To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While allocating pages using buddy allocator, the compound page is
probably split up to free pages. Under these circumstances, the compound
page should be destroyed by destroy_compound_page(). However, there is a
duplicate check to judge if the page is compound.
Remove the duplicate check since the compound_order() returns 0 when the
page doesn't have PG_head set in destroy_compound_page(). That is to say,
destroy_compound_page() needn't check PageHead().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
Ka0JKgnWvsa6ez6FSzKI
=ivQa
-----END PGP SIGNATURE-----
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma
Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
Merge misc VM changes from Andrew Morton:
"The rest of most-of-MM. The other MM bits await a slab merge.
This patch includes the addition of a huge zero_page. Not a
performance boost but it an save large amounts of physical memory in
some situations.
Also a bunch of Fujitsu engineers are working on memory hotplug.
Which, as it turns out, was badly broken. About half of their patches
are included here; the remainder are 3.8 material."
However, this merge disables CONFIG_MOVABLE_NODE, which was totally
broken. We don't add new features with "default y", nor do we add
Kconfig questions that are incomprehensible to most people without any
help text. Does the feature even make sense without compaction or
memory hotplug?
* akpm: (54 commits)
mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
mm/memory.c: remove unused code from do_wp_page()
asm-generic, mm: pgtable: consolidate zero page helpers
mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
hwpoison, hugetlbfs: fix RSS-counter warning
hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
mm: protect against concurrent vma expansion
memcg: do not check for mm in __mem_cgroup_count_vm_event
tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
mm: provide more accurate estimation of pages occupied by memmap
fs/buffer.c: remove redundant initialization in alloc_page_buffers()
fs/buffer.c: do not inline exported function
writeback: fix a typo in comment
mm: introduce new field "managed_pages" to struct zone
mm, oom: remove statically defined arch functions of same name
mm, oom: remove redundant sleep in pagefault oom handler
mm, oom: cleanup pagefault oom handler
memory_hotplug: allow online/offline memory to result movable node
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
mm, memcg: avoid unnecessary function call when memcg is disabled
...
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
If SPARSEMEM is enabled, it won't build page structures for non-existing
pages (holes) within a zone, so provide a more accurate estimation of
pages occupied by memmap if there are bigger holes within the zone.
And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.
[akpm@linux-foundation.org: mark calc_memmap_size __paging_init]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Tested-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
spanned_pages - absent_pages - memmap_pages - dma_reserve.
During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages may
have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.
For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866https://patchwork.kernel.org/patch/1346751/
This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.
We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.
For DMA/normal zones, the initial value is set to:
(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().
The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.
This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.
[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need a node which only contains movable memory. This feature is very
important for node hotplug. If a node has normal/highmem, the memory may
be used by the kernel and can't be offlined. If the node only contains
movable memory, we can offline the memory and the node.
All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node
[akpm@linux-foundation.org: fix Kconfig text]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Since we introduced N_MEMORY, we update the initialization of node_states.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_contig_migrate_range() should use all possible ways to get all the
pages migrated from the given memory range, so pruning per-cpu lru lists
for all CPUs is required, regadless the cost of such operation. Otherwise
some pages which got stuck at per-cpu lru list might get missed by
migration procedure causing the contiguous allocation to fail.
Reported-by: SeongHwan Yoon <sunghwan.yun@samsung.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits 2139cbe627 ("cma: fix counting of isolated pages") and
d95ea5d18e ("cma: fix watermark checking") introduced a reliable
method of free page accounting when memory is being allocated from CMA
regions, so the workaround introduced earlier by commit 49f223a9cd
("mm: trigger page reclaim in alloc_contig_range() to stabilise
watermarks") can be finally removed.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2139cbe627 ("cma: fix counting of isolated pages") free
pages in isolated pageblocks are not accounted to NR_FREE_PAGES counters,
so watermarks check is not required if one operates on a free page in
isolated pageblock.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.
As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use __free_page() to put a page to buddy system when onlining pages.
__free_page() will store NR_FREE_PAGES in zone's pcp.vm_stat_diff, so we
should allocate zone's pcp before onlining pages, otherwise we will lose
some free pages.
[mhocko@suse.cz: make zone_pcp_reset independent of MEMORY_HOTREMOVE]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_FREE_PAGES will be wrong after offlining pages. We add/dec
NR_FREE_PAGES like this now:
1. move all pages in buddy system to MIGRATE_ISOLATE, and dec NR_FREE_PAGES
2. don't add NR_FREE_PAGES when it is freed and the migratetype is
MIGRATE_ISOLATE
3. dec NR_FREE_PAGES when offlining isolated pages.
4. add NR_FREE_PAGES when undoing isolate pages.
When we come to step 3, all pages are in MIGRATE_ISOLATE list, and
NR_FREE_PAGES are right. When we come to step4, all pages are not in
buddy system, so we don't change NR_FREE_PAGES in this step, but we change
NR_FREE_PAGES in step3. So NR_FREE_PAGES is wrong after offlining pages.
So there is no need to change NR_FREE_PAGES in step3.
This patch also fixs a problem in step2: if the migratetype is
MIGRATE_ISOLATE, we should not add NR_FRR_PAGES when we remove pages from
pcppages.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jianguo Wu <wujianguo106@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hwpoisoned may be set when we offline a page by the sysfs interface
/sys/devices/system/memory/soft_offline_page or
/sys/devices/system/memory/hard_offline_page. If we don't clear
this flag when onlining pages, this page can't be freed, and will
not in free list. So we can't offline these pages again. So we
should skip such page when offlining pages.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need custom NUMA_BUILD anymore, since we have handy
IS_ENABLED().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is useful to diagnose the reason for page allocation failure for
cases where there appear to be several free pages.
Example, with this alloc_pages(GFP_ATOMIC) failure:
swapper/0: page allocation failure: order:0, mode:0x0
...
Mem-info:
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 48
CPU 1: hi: 90, btch: 15 usd: 21
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:84 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:4026 slab_reclaimable:75 slab_unreclaimable:484
mapped:0 shmem:0 pagetables:0 bounce:0
Normal free:16104kB min:2296kB low:2868kB high:3444kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:336kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:331776kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:300kB
slab_unreclaimable:1936kB kernel_stack:328kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Before the patch, it's hard (for me, at least) to say why all these free
chunks weren't considered for allocation:
Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB
1*1024kB 1*2048kB 3*4096kB = 16128kB
After the patch, it's obvious that the reason is that all of these are
in the MIGRATE_CMA (C) freelist:
Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB
(C) 1*1024kB (C) 1*2048kB (C) 3*4096kB (C) = 16128kB
Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.
Signed-off-by: Mel Gorman <mgorman@suse.de>
This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
This reverts commits a50915394f and
d7c3b937bd.
This is a revert of a revert of a revert. In addition, it reverts the
even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
original commits in linux-next.
It turns out that the original patch really was bogus, and that the
original revert was the correct thing to do after all. We thought we
had fixed the problem, and then reverted the revert, but the problem
really is fundamental: waking up kswapd simply isn't the right thing to
do, and direct reclaim sometimes simply _is_ the right thing to do.
When certain allocations fail, we simply should try some direct reclaim,
and if that fails, fail the allocation. That's the right thing to do
for THP allocations, which can easily fail, and the GPU allocations want
to do that too.
So starting kswapd is sometimes simply wrong, and removing the flag that
said "don't start kswapd" was a mistake. Let's hope we never revisit
this mistake again - and certainly not this many times ;)
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 782fd30406.
We are going to reinstate the __GFP_NO_KSWAPD flag that has been
removed, the removal reverted, and then removed again. Making this
commit a pointless fixup for a problem that was caused by the removal of
__GFP_NO_KSWAPD flag.
The thing is, we really don't want to wake up kswapd for THP allocations
(because they fail quite commonly under any kind of memory pressure,
including when there is tons of memory free), and these patches were
just trying to fix up the underlying bug: the original removal of
__GFP_NO_KSWAPD in commit c654345924 ("mm: remove __GFP_NO_KSWAPD")
was simply bogus.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I've legally changed my name with New York State, the US Social Security
Administration, et al. This patch propagates the name change and change
in initials and login to comments in the kernel source as well.
Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following
Hmm, so it's just took longer to hit the problem and observe
kswapd0 spinning on my CPU again - it's not as endless like before -
but still it easily eats minutes - it helps to turn off Firefox
or TB (memory hungry apps) so kswapd0 stops soon - and restart
those apps again. (And I still have like >1GB of cached memory)
kswapd0 R running task 0 30 2 0x00000000
Call Trace:
preempt_schedule+0x42/0x60
_raw_spin_unlock+0x55/0x60
put_super+0x31/0x40
drop_super+0x22/0x30
prune_super+0x149/0x1b0
shrink_slab+0xba/0x510
The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be
reclaimed.
The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.
If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still be
the the case that kswapd is awake for a prolonged period of time as
pgdat->kswapd_max_order is updated each time. This is noticed by the
main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead
it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.
This patch defers when kswapd gets woken up for THP allocations. For
!THP allocations, kswapd is always woken up. For THP allocations,
kswapd is woken up iff the process is willing to enter into direct
reclaim/compaction.
[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It apepars that this patch was innocent, and we hope that "mm: avoid
waking kswapd for THP allocations when compaction is deferred or
contended" will fix the final kswapd-spinning cause.
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ef6c5be658 ("fix incorrect NR_FREE_PAGES accounting (appears
like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the
return value which was also missed by this reviewer until today.
That return value is used by compaction when adding pages to a list of
isolated free pages and without this follow-up fix, there is a risk of
free list corruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following
Hmm, so it's just took longer to hit the problem and observe
kswapd0 spinning on my CPU again - it's not as endless like before -
but still it easily eats minutes - it helps to turn off Firefox
or TB (memory hungry apps) so kswapd0 stops soon - and restart
those apps again. (And I still have like >1GB of cached memory)
kswapd0 R running task 0 30 2 0x00000000
Call Trace:
preempt_schedule+0x42/0x60
_raw_spin_unlock+0x55/0x60
put_super+0x31/0x40
drop_super+0x22/0x30
prune_super+0x149/0x1b0
shrink_slab+0xba/0x510
The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be
reclaimed.
The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.
If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still be
the the case that kswapd is awake for a prolonged period of time as
pgdat->kswapd_max_order is updated each time. This is noticed by the
main kswapd() loop and it will not call kswapd_try_to_sleep(). Instead
it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.
The temptation is to supply a patch that checks if kswapd was woken for
THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
backed up by proper testing. As 3.7 is very close to release and this
is not a bug we should release with, a safer path is to revert "mm:
remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
out the balance_pgdat() logic in general.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There have been some 3.7-rc reports of vm issues, including some kswapd
bugs and, more importantly, some memory "leaks":
http://www.spinics.net/lists/linux-mm/msg46187.htmlhttps://bugzilla.kernel.org/show_bug.cgi?id=50181
Commit 1fb3f8ca0e ("mm: compaction: capture a suitable high-order page
immediately when it is made available") took split_free_page() and
reused it for the compaction code. It does something curious with
capture_free_page() (previously known as split_free_page()):
int capture_free_page(struct page *page, int alloc_order,
...
__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
- /* Split into individual pages */
- set_page_refcounted(page);
- split_page(page, order);
+ if (alloc_order != order)
+ expand(zone, page, alloc_order, order,
+ &zone->free_area[order], migratetype);
Note that expand() puts the pages _back_ in the allocator, but it does
not bump NR_FREE_PAGES. We "return" 'alloc_order' worth of pages, but
we accounted for removing 'order' in the __mod_zone_page_state() call.
For the old split_page()-style use (order==alloc_order) the bug will not
trigger. But, when called from the compaction code where we
occasionally get a larger page out of the buddy allocator than we need,
we will run in to this.
This patch simply changes the NR_FREE_PAGES manipulation to the correct
'alloc_order' instead of 'order'.
I've been able to repeatedly trigger this in my testing environment.
The amount "leaked" very closely tracks the imbalance I see in buddy
pages vs. NR_FREE_PAGES. I have confirmed that this patch fixes the
imbalance
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 7f1290f2f2 ("mm: fix-up zone present pages")
That patch tried to fix a issue when calculating zone->present_pages,
but it caused a regression on 32bit systems with HIGHMEM. With that
change, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator. Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.
Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec. (On
many configurations, calculating zone from page is slower.)
But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded. Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
IP: __mod_zone_page_state+0x9/0x60
Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
Call Trace:
__pagevec_lru_add_fn+0xdf/0x140
pagevec_lru_move_fn+0xb1/0x100
__pagevec_lru_add+0x1c/0x30
lru_add_drain_cpu+0xa3/0x130
lru_add_drain+0x2f/0x40
...
The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec->zone. The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.
So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.
Ah, there was one exceptionr. For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too. In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.
I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.
Reported-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 957f822a0a ("mm, numa: reclaim from all nodes within reclaim
distance") caused zone_reclaim_mode to be set for all systems where two
nodes are within RECLAIM_DISTANCE of each other. This is the opposite
of what we actually want: zone_reclaim_mode should be set if two nodes
are sufficiently distant.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Julian Wollrath <jwollrath@web.de>
Tested-by: Julian Wollrath <jwollrath@web.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Patrik Kullman <patrik.kullman@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If start_isolate_page_range() failed, unset_migratetype_isolate() has been
done inside it.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Ni zhan Chen <nizhan.chen@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
reclaim_clean_pages_from_list() reclaims clean pages before migration so
cc.nr_migratepages should be updated. Currently, there is no problem but
it can be wrong if we try to use the value in future.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.
This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.
[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang. When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.
The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.
In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.
To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.
What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I think zone->present_pages indicates pages that buddy system can management,
it should be:
zone->present_pages = spanned pages - absent pages - bootmem pages,
but is now:
zone->present_pages = spanned pages - absent pages - memmap pages.
spanned pages: total size, including holes.
absent pages: holes.
bootmem pages: pages used in system boot, managed by bootmem allocator.
memmap pages: pages used by page structs.
This may cause zone->present_pages less than it should be. For example,
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
present_pages should be spanned pages - absent pages, but now it also
minus memmap pages(free_area_init_core), which are actually allocated from
ZONE_MOVABLE. When offlining all memory of a zone, this will cause
zone->present_pages less than 0, because present_pages is unsigned long
type, it is actually a very large integer, it indirectly caused
zone->watermark[WMARK_MIN] becomes a large
integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
large integer(calculate_totalreserve_pages()), and finally cause memory
allocating failure when fork process(__vm_enough_memory()).
[root@localhost ~]# dmesg
-bash: fork: Cannot allocate memory
I think the bug described in
http://marc.info/?l=linux-mm&m=134502182714186&w=2
is also caused by wrong zone present pages.
This patch intends to fix-up zone->present_pages when memory are freed to
buddy system on x86_64 and IA64 platforms.
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Tested-by: Petr Tesarik <ptesarik@suse.cz>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
it out (move + rename as a common name) into page_isolation.c.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.
Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.
1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.
2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.
The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When compaction was implemented it was known that scanning could
potentially be excessive. The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line. It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.
Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.
The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch
o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
o If there are a large number of allocation failures, discard the cache.
The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure. Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page. The time-based mechanism is
clumsy but a better option is not obvious.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 7db8889ab0 ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages"). These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand. A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0ee332c145 ("memblock: Kill early_node_map[]") removed
early_node_map[]. Clean up the comments to comply with that change.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If race between allocation and isolation in memory-hotplug offline
happens, some pages could be in MIGRATE_MOVABLE of free_list although the
pageblock's migratetype of the page is MIGRATE_ISOLATE.
The race could be detected by get_freepage_migratetype in
__test_page_isolated_in_pageblock. If it is detected, now EBUSY gets
bubbled all the way up and the hotplug operations fails.
But better idea is instead of returning and failing memory-hotremove, move
the free page to the correct list at the time it is detected. It could
enhance memory-hotremove operation success ratio although the race is
really rare.
Suggested by Mel Gorman.
[akpm@linux-foundation.org: small cleanup]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator caches the pageblock information in page->private while
it is in the PCP freelists but this is overwritten with the order of the
page when freed to the buddy allocator. This patch stores the migratetype
of the page in the page->index field so that it is available at all times
when the page remain in free_list.
This patch adds a new call site in __free_pages_ok so it might be overhead
a bit but it's for high order allocation. So I believe damage isn't hurt.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator uses set_page_private and page_private for handling
migratetype when it frees page. Let's replace them with [set|get]
_freepage_migratetype to make it more clear.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
(from Minchan Kim).
* During watermark check decrease available free pages number by
free CMA pages number if necessary (unmovable allocations cannot
use pages from CMA areas).
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).
[akpm@linux-foundation.org: use conventional migratetype naming]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter. Fix
it by properly decreasing/increasing NR_FREE_PAGES counter in
set_migratetype_isolate()/unset_migratetype_isolate() and removing counter
adjustment for isolated pages from free_one_page() and split_free_page().
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page->private gets re-used in __free_one_page() to store page order
(so trace_mm_page_pcpu_drain() may print order instead of migratetype)
thus migratetype value must be cached locally.
Fixes regression introduced in commit a7016235a6 ("mm: fix migratetype
bug which slowed swapping"). This caused incorrect data to be attached
to the mm_page_pcpu_drain trace event.
[akpm@linux-foundation.org: add comment]
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop clean cache pages instead of migration during alloc_contig_range() to
minimise allocation latency by reducing the amount of migration that is
necessary. It's useful for CMA because latency of migration is more
important than evicting the background process's working set. In
addition, as pages are reclaimed then fewer free pages for migration
targets are required so it avoids memory reclaiming to get free pages,
which is a contributory factor to increased latency.
I measured elapsed time of __alloc_contig_migrate_range() which migrates
10M in 40M movable zone in QEMU machine.
Before - 146ms, After - 7ms
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rik van Riel <riel@redhat.com>
Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While compaction is migrating pages to free up large contiguous blocks
for allocation it races with other allocation requests that may steal
these blocks or break them up. This patch alters direct compaction to
capture a suitable free page as soon as it becomes available to reduce
this race. It uses similar logic to split_free_page() to ensure that
watermarks are still obeyed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When transparent huge pages were introduced, memory compaction and swap
storms were an issue, and the kernel had to be careful to not make THP
allocations cause pageout or compaction.
Now that we have working compaction deferral, kswapd is smart enough to
invoke compaction and the quadratic behaviour around isolate_free_pages
has been fixed, it should be safe to remove __GFP_NO_KSWAPD.
[minchan@kernel.org: Comment fix]
[mgorman@suse.de: Avoid direct reclaim for deferred compaction]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The heuristic method for buddy has been introduced since commit
43506fad21 ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists"). But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.
Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.
Signed-off-by: Haifeng Li <omycle@gmail.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KyongHo Cho <pullip.cho@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;
The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.
Early in the test everything looks fine
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0
and then it goes to pot
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0
Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found
perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]
There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.
commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.
This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit cfd19c5a9e ("mm: only set page->pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
setting, but it missed some places the pfmemalloc should be set.
So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:
64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|---0.34%-- deactivate_slab
| __slab_alloc
| kmem_cache_alloc
| |
That causes our fio sync write performance to have a 40% regression.
Move the checking in get_page_from_freelist() which resolves this issue.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Borislav Petkov reports that the new warning added in commit
88fdf75d1b ("mm: warn if pg_data_t isn't initialized with zero")
triggers for him, and it is the node_start_pfn field that has already
been initialized once.
The call trace looks like this:
x86_64_start_kernel ->
x86_64_start_reservations ->
start_kernel ->
setup_arch ->
paging_init ->
zone_sizes_init ->
free_area_init_nodes ->
free_area_init_node
and (with the warning replaced by debug output), Borislav sees
On node 0 totalpages: 4193848
DMA zone: 64 pages used for memmap
DMA zone: 6 pages reserved
DMA zone: 3890 pages, LIFO batch:0
DMA32 zone: 16320 pages used for memmap
DMA32 zone: 798464 pages, LIFO batch:31
Normal zone: 52736 pages used for memmap
Normal zone: 3322368 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 4423680 <----
On node 1 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 8617984 <----
On node 2 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 12812288 <----
On node 3 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
so remove the bogus warning for now to avoid annoying people. Minchan
Kim is looking at it.
Reported-by: Borislav Petkov <bp@amd64.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pg_data_t is zeroed before reaching free_area_init_core(), so remove the
now unnecessary initializations.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
when it is allocated. Arch code and memory hotplug already initiailize
pg_data_t. So this warning should never happen. I select fields randomly
near the beginning, middle and end of pg_data_t for checking.
This patch isn't for performance but for removing initialization code
which is necessary to add whenever we adds new field to pg_data_t or zone.
Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
Tejun doesn't like it because in the future, some archs can initialize
some fields in arch code and pass them into general MM part so blindly
clearing it out in mm core part would be very annoying.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.
This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reserve is proportionally distributed over all !highmem zones in the
system. So we need to allow an emergency allocation access to all zones.
In order to do that we need to break out of any mempolicy boundaries we
might have.
In my opinion that does not break mempolicies as those are user oriented
and not system oriented. That is, system allocations are not guaranteed
to be within mempolicy boundaries. For instance IRQs do not even have a
mempolicy.
So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_slowpath() is called when the number of free pages is below
the low watermark. If the caller is entitled to use ALLOC_NO_WATERMARKS
then the page will be marked page->pfmemalloc. This protects more pages
than are strictly necessary as we only need to protect pages allocated
below the min watermark (the pfmemalloc reserves).
This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
required to allocate the page.
[rientjes@google.com: David noticed the problem during review]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. This patch borrows the task flags from whatever process happens
to be preempted by the softirq. It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process. This should be safe due to
the following reasons
Softirqs can run on multiple CPUs sure but the same task should not be
executing the same softirq code. Neither should the softirq
handler be preempted by any other softirq handler so the flags
should not leak to an unrelated softirq.
Softirqs re-enable hardware interrupts in __do_softirq() so can be
preempted by hardware interrupts so PF_MEMALLOC is inherited
by the hard IRQ. However, this is similar to a process in
reclaim being preempted by a hardirq. While PF_MEMALLOC is
set, gfp_to_alloc_flags() distinguishes between hard and
soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
flag.
If the softirq is deferred to ksoftirq then its flags may be used
instead of a normal tasks but as the softirq cannot be preempted,
the PF_MEMALLOC flag does not leak to other code by accident.
[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
like PF_MEMALLOC. It allows one to pass along the memalloc state in
object related allocation flags as opposed to task related flags, such as
sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
enough to identify allocations related to page reclaim.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotplug offlining happens on zone A, it starts to mark freed page as
MIGRATE_ISOLATE type in buddy for preventing further allocation.
(MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
we can't allocate them).
When the memory shortage happens during hotplug offlining, current task
starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
sleep because current zone_watermark_ok_safe doesn't consider
MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
direct reclaim path without kswapd's helping. The problem is that
zone->all_unreclaimable is set by only kswapd so that current task would
be looping forever like below.
__alloc_pages_slowpath
restart:
wake_all_kswapd
rebalance:
__alloc_pages_direct_reclaim
do_try_to_free_pages
if global_reclaim && !all_unreclaimable
return 1; /* It means we did did_some_progress */
skip __alloc_pages_may_oom
should_alloc_retry
goto rebalance;
If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
setting zone->all_unreclaimable, we can solve this problem by killing some
task in direct reclaim path. But it doesn't wake up kswapd, still. It
could be a problem still if other subsystem needs GFP_ATOMIC request. So
kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
going sleep.
This patch counts the number of MIGRATE_ISOLATE page block and
zone_watermark_ok_safe will consider it if the system has such blocks
(fortunately, it's very rare so no problem in POV overhead and kswapd is
never hotpath).
Copy/modify from Mel's quote
"
Ideal solution would be "allocating" the pageblock.
It would keep the free space accounting as it is but historically,
memory hotplug didn't allocate pages because it would be difficult to
detect if a pageblock was isolated or if part of some balloon.
Allocating just full pageblocks would work around this, However,
it would play very badly with CMA.
"
[1] http://lkml.org/lkml/2012/6/14/74
[akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
[akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__zone_watermark_ok currently compares free_pages which is a signed type
with z->lowmem_reserve[classzone_idx] which is unsigned which might lead
to sign overflow if free_pages doesn't satisfy the given order (or it came
as negative already) and then we rely on the following order loop to fix
it (which doesn't work for order-0). Let's fix the type conversion and do
not rely on the given value of free_pages or follow up fixups.
This patch fixes it because "memory-hotplug: fix kswapd looping forever
problem" depends on this.
As benefit of this patch, it doesn't rely on the loop to exit
__zone_watermark_ok in case of high order check and make the first test
effective.(ie, if (free_pages <= min + lowmem_reserve))
Aaditya reported this problem when he test my hotplug patch.
Reported-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Tested-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc.c has some memory isolation functions but they are used only
when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make
it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark functions used by both boot and memory hotplug as __meminit to reduce
memory footprint when memory hotplug is disabled.
Alos guard zone_pcp_update() with CONFIG_MEMORY_HOTPLUG because it's only
used by memory hotplug code.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Cc: Wei Wang <Bessel.Wang@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a zone becomes empty after memory offlining, free zone->pageset.
Otherwise it will cause memory leak when adding memory to the empty zone
again because build_all_zonelists() will allocate zone->pageset for an
empty zone.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Wei Wang <Bessel.Wang@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
Itanium, pageblock_order is a variable with default value of 0. It's set
to the right value by set_pageblock_order() in function
free_area_init_core().
But pageblock_order may be used by sparse_init() before free_area_init_core()
is called along path:
sparse_init()
->sparse_early_usemaps_alloc_node()
->usemap_size()
->SECTION_BLOCKFLAGS_BITS
->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
NR_PAGEBLOCK_BITS)
The uninitialized pageblock_size will cause memory wasting because
usemap_size() returns a much bigger value then it's really needed.
For example, on an Itanium platform,
sparse_init() pageblock_order=0 usemap_size=24576
free_area_init_core() before pageblock_order=0, usemap_size=24576
free_area_init_core() after pageblock_order=12, usemap_size=8
That means 24K memory has been wasted for each section, so fix it by calling
set_pageblock_order() from sparse_init().
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.
However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.
This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.
The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.
Forced full (order == -1) compactions are left alone.
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __count_immobile_pages() naming is rather awkward. Choose a more
clear name and add a comment.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d179e84ba ("mm: vmscan: do not use page_count without a page pin") fixed
this problem in vmscan.c but same problem is in __count_immobile_pages().
I copy and paste d179e84ba's contents for description.
"It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU."
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds
pages to the buddy allocator again") fixed one free_pcppages_bulk()
misuse. But two another miuse still exist.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, function should_fail() has "bool" for its return value, so it's
reasonable to change the return value of function should_fail_alloc_page()
into "bool" as well.
The patch does cleanup on function should_fail_alloc_page() to have "bool"
for its return value.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The filesystem layer expects pages in the block device's mapping to not
be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
currently replace lowmem pages with highmem pages, leading to crashes in
filesystem code such as the one below:
Unable to handle kernel NULL pointer dereference at virtual address 00000400
pgd = c0c98000
[00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
Internal error: Oops: 817 [#1] PREEMPT SMP ARM
CPU: 0 Not tainted (3.5.0-rc5+ #80)
PC is at __memzero+0x24/0x80
...
Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
Backtrace:
[<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
[<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
r4:c15337f0
[<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
[<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
[<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
r6:beccdcf0 r5:00074000 r4:beccdbbc
[<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
Fix this by replacing only highmem pages with highmem.
Reported-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
This reverts commit 5ceb9ce6fe.
That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported. The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.
So revert it for now, we can re-visit this for 3.6. If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.
If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.
We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
String tables with names of enum items are always prone to go out of
sync with the enums themselves. Ensure during compile time that the
name table of page flags has the same size as the page flags enum.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The array pageflag_names[] does conversion from page flags into their
corresponding names so that a meaningful representation of the
corresponding page flag can be printed. This mechanism is used while
dumping page frames. However, the array missed PG_compound_lock. So
the PG_compound_lock page flag would be printed as a digital number
instead of a meaningful string.
The patch fixes that and prints "compound_lock" for the PG_compound_lock
page flag.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
allocation takes ownership of the block may take too long. The type of
the pageblock remains unchanged so the pageblock cannot be used as a
migration target during compaction.
Fix it by:
* Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
COMPACT_SYNC) and then converting sync field in struct compact_control
to use it.
* Adding nr_pageblocks_skipped field to struct compact_control and
tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
suitable page for allocation. In this case then check how if there were
enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
COMPACT_ASYNC_UNMOVABLE mode.
* Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
sets change the whole pageblock type to MIGRATE_MOVABLE.
My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
131072 standard 4KiB pages in 'Normal' zone) is to:
- allocate 120000 pages for kernel's usage
- free every second page (60000 pages) of memory just allocated
- allocate and use 60000 pages from user space
- free remaining 60000 pages of kernel memory
(now we have fragmented memory occupied mostly by user space pages)
- try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
The results:
- with compaction disabled I get 11 successful allocations
- with compaction enabled - 14 successful allocations
- with this patch I'm able to get all 100 successful allocations
NOTE: If we can make kswapd aware of order-0 request during compaction, we
can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
(COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
following thread:
http://marc.info/?l=linux-mm&m=133552069417068&w=2
[minchan@kernel.org: minor cleanups]
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This has always been broken: one version takes an unsigned int and the
other version takes no arguments. This bug was hidden because one
version of set_pageblock_order() was a macro which doesn't evaluate its
argument.
Simplify it all and remove pageblock_default_order() altogether.
Reported-by: rajman mekaco <rajman.mekaco@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel. For example:
-Zone PFN ranges:
+Zone ranges:
- DMA32 0x00000010 -> 0x00100000
+ DMA32 [mem 0x00010000-0xffffffff]
- Normal 0x00100000 -> 0x01080000
+ Normal [mem 0x100000000-0x107fffffff]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
"These patches contain two major updates for DMA mapping subsystem
(mainly for ARM architecture). First one is Contiguous Memory
Allocator (CMA) which makes it possible for device drivers to allocate
big contiguous chunks of memory after the system has booted.
The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse the memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework migrates system pages to create space for the required big
chunk of physically contiguous memory.
For more information one can refer to nice LWN articles:
- 'A reworked contiguous memory allocator':
http://lwn.net/Articles/447405/
- 'CMA and ARM':
http://lwn.net/Articles/450286/
- 'A deep dive into CMA':
http://lwn.net/Articles/486301/
- and the following thread with the patches and links to all previous
versions:
https://lkml.org/lkml/2012/4/3/204
The main client for this new framework is ARM DMA-mapping subsystem.
The second part provides a complete redesign in ARM DMA-mapping
subsystem. The core implementation has been changed to use common
struct dma_map_ops based infrastructure with the recent updates for
new dma attributes merged in v3.4-rc2. This allows to use more than
one implementation of dma-mapping calls and change/select them on the
struct device basis. The first client of this new infractructure is
dmabounce implementation which has been completely cut out of the
core, common code.
The last patch of this redesign update introduces a new, experimental
implementation of dma-mapping calls on top of generic IOMMU framework.
This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
calls if one provides required IOMMU hardware.
For more information please refer to the following thread:
http://www.spinics.net/lists/arm-kernel/msg175729.html
The last patch merges changes from both updates and provides a
resolution for the conflicts which cannot be avoided when patches have
been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."
Acked by Andrew Morton <akpm@linux-foundation.org>:
"Yup, this one please. It's had much work, plenty of review and I
think even Russell is happy with it."
* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
ARM: dma-mapping: use PMD size for section unmap
cma: fix migration mode
ARM: integrate CMA with DMA-mapping subsystem
X86: integrate CMA with DMA-mapping subsystem
drivers: add Contiguous Memory Allocator
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
mm: extract reclaim code from __alloc_pages_direct_reclaim()
mm: Serialize access to min_free_kbytes
mm: page_isolation: MIGRATE_CMA isolation functions added
mm: mmzone: MIGRATE_CMA migration type added
mm: page_alloc: change fallbacks array handling
mm: page_alloc: introduce alloc_contig_range()
mm: compaction: export some of the functions
mm: compaction: introduce isolate_freepages_range()
mm: compaction: introduce map_pages()
mm: compaction: introduce isolate_migratepages_range()
mm: page_alloc: remove trailing whitespace
ARM: dma-mapping: add support for IOMMU mapper
ARM: dma-mapping: use alloc, mmap, free from dma_ops
ARM: dma-mapping: remove redundant code and do the cleanup
...
Conflicts:
arch/x86/include/asm/dma-mapping.h
Pull more networking updates from David Miller:
"Ok, everything from here on out will be bug fixes."
1) One final sync of wireless and bluetooth stuff from John Linville.
These changes have all been in his tree for more than a week, and
therefore have had the necessary -next exposure. John was just away
on a trip and didn't have a change to send the pull request until a
day or two ago.
2) Put back some defines in user exposed header file areas that were
removed during the tokenring purge. From Stephen Hemminger and Paul
Gortmaker.
3) A bug fix for UDP hash table allocation got lost in the pile due to
one of those "you got it.. no I've got it.." situations. :-)
From Tim Bird.
4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
try to coalesce overlapping frags and crash. Fix from Eric Dumazet.
5) RCU routing table lookups can race with free_fib_info(), causing
crashes when we deref the device pointers in the route. Fix by
releasing the net device in the RCU callback. From Yanmin Zhang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
tcp: take care of overlaps in tcp_try_coalesce()
ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
mm: add a low limit to alloc_large_system_hash
ipx: restore token ring define to include/linux/ipx.h
if: restore token ring ARP type to header
xen: do not disable netfront in dom0
phy/micrel: Fix ID of KSZ9021
mISDN: Add X-Tensions USB ISDN TA XC-525
gianfar:don't add FCB length to hard_header_len
Bluetooth: Report proper error number in disconnection
Bluetooth: Create flags for bt_sk()
Bluetooth: report the right security level in getsockopt
Bluetooth: Lock the L2CAP channel when sending
Bluetooth: Restore locking semantics when looking up L2CAP channels
Bluetooth: Fix a redundant and problematic incoming MTU check
Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
Bluetooth: Fix EIR data generation for mgmt_device_found
Bluetooth: Fix Inquiry with RSSI event mask
Bluetooth: improve readability of l2cap_seq_list code
Bluetooth: Fix skb length calculation
...
UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.
On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.
As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.
This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.
We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.
Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's the driver core, and other driver subsystems, pull request for
the 3.5-rc1 merge window.
Outside of a few minor driver core changes, we ended up with the
following different subsystem and core changes as well, due to
interdependancies on the driver core:
- hyperv driver updates
- drivers/memory being created and some drivers moved into it
- extcon driver subsystem created out of the old Android staging switch
driver code
- dynamic debug updates
- printk rework, and /dev/kmsg changes
All of this has been tested in the linux-next releases for a few weeks
with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk+7q28ACgkQMUfUDdst+ykXmwCfcPASzC+/bDkuqdWsqzxlWZ7+
VOQAnAriySv397St36J6Hz5bMQZwB1Yq
=SQc+
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg Kroah-Hartman:
"Here's the driver core, and other driver subsystems, pull request for
the 3.5-rc1 merge window.
Outside of a few minor driver core changes, we ended up with the
following different subsystem and core changes as well, due to
interdependancies on the driver core:
- hyperv driver updates
- drivers/memory being created and some drivers moved into it
- extcon driver subsystem created out of the old Android staging
switch driver code
- dynamic debug updates
- printk rework, and /dev/kmsg changes
All of this has been tested in the linux-next releases for a few weeks
with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed
that a patch to the deleted drivers/misc/max8997-muic.c driver needs to
be applied to this one.
* tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits)
uio_pdrv_genirq: get irq through platform resource if not set otherwise
memory: tegra{20,30}-mc: Remove empty *_remove()
printk() - isolate KERN_CONT users from ordinary complete lines
sysfs: get rid of some lockdep false positives
Drivers: hv: util: Properly handle version negotiations.
Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp()
memory: tegra{20,30}-mc: Use dev_err_ratelimited()
driver core: Add dev_*_ratelimited() family
Driver Core: don't oops with unregistered driver in driver_find_device()
printk() - restore prefix/timestamp printing for multi-newline strings
printk: add stub for prepend_timestamp()
ARM: tegra30: Make MC optional in Kconfig
ARM: tegra20: Make MC optional in Kconfig
ARM: tegra30: MC: Remove unnecessary BUG*()
ARM: tegra20: MC: Remove unnecessary BUG*()
printk: correctly align __log_buf
ARM: tegra30: Add Tegra Memory Controller(MC) driver
ARM: tegra20: Add Tegra Memory Controller(MC) driver
printk() - restore timestamp printing at console output
printk() - do not merge continuation lines of different threads
...
__alloc_contig_migrate_range calls migrate_pages with wrong argument
for migrate_mode. Fix it.
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
This patch extracts common reclaim code from __alloc_pages_direct_reclaim()
function to separate function: __perform_reclaim() which can be later used
by alloc_contig_range().
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.
[a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.
This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).
It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.
To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
This commit adds a row for MIGRATE_ISOLATE type to the fallbacks array
which was missing from it. It also, changes the array traversal logic
a little making MIGRATE_RESERVE an end marker. The letter change,
removes the implicit MIGRATE_UNMOVABLE from the end of each row which
was read by __rmqueue_fallback() function.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
This commit adds the alloc_contig_range() function which tries
to allocate given range of pages. It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
Why is there less MemFree than there used to be? It perturbed a test,
so I've just been bisecting linux-next, and now find the offender went
upstream yesterday.
Commit 93278814d3 "mm: fix division by 0 in percpu_pagelist_fraction()"
mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
us expect it to be left unset at 0 (and it's not then used as a divisor).
MemTotal: 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB
Repetitive test with percpu_pagelist_fraction 8:
MemFree: 6948420kB 6237172kB 6949696kB 6840692kB 6949048kB 6862984kB
Same test with percpu_pagelist_fraction back to 0:
MemFree: 7945000kB 7944908kB 7948568kB 7949060kB 7948796kB 7948812kB
Signed-off-by: Hugh Dickins <hughd@google.com>
[ We really should fix the crazy sysctl interface too, but that's a
separate thing - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as
a possible error from proc_dointvec_minmax().
If any other error is returned, it would proceed to divide by zero since
percpu_pagelist_fraction wasn't getting initialized at any point. For
example, writing 0 bytes into the proc file would trigger the issue.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.
This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).
This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.
Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.
In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:
$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0
$ cat /proc/interrupts | grep CAL
CAL: 1 2 1 2
2 2 2 2 Function call interrupts
$ hackbench 400
[OOM messages snipped]
$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642
$ cat /proc/interrupts | grep CAL
CAL: 6 13 6 3
3 3 1 2 7 Function call interrupts
Please note that if the global drain is removed from the direct reclaim
path as a patch from Mel Gorman currently suggests this should be replaced
with an on_each_cpu_cond invocation.
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:
- the coredumping process may be chosen for oom kill and quickly deplete
all memory reserves in oom conditions preventing further progress from
being made or tasks from exiting,
- the coredumping process may cause other processes to be oom killed
without fault of their own as the result of a SIGSEGV, for example, in
the coredumping process, or
- the coredumping process may result in a livelock while writing to the
dump file if it needs memory to allocate while other threads are in
the exit path waiting on the coredumper to complete.
This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_zone_movable_pfns_for_nodes() does not use its argument.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add_from_early_node_map() is unused.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This cpu hotplug hook was accidentally removed in commit 00a62ce91e
("mm: fix Committed_AS underflow on large NR_CPUS environment")
The visible effect of this accident: some pages are borrowed in per-cpu
page-vectors. Truncate can deal with it, but these pages cannot be
reused while this cpu is offline. So this is like a temporary memory
leak.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer chooses not to kill a thread if:
- an eligible thread has already been oom killed and has yet to exit,
and
- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.
SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.
For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.
The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
page_zone() requires an online node otherwise we are accessing NULL
NODE_DATA. This is not an issue at the moment because node_zones are
located at the structure beginning but this might change in the future
so better be careful about that.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following NULL ptr dereference caused by
cat /sys/devices/system/memory/memory0/removable
Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b
We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000
The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.
When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.
We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.
Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed. For small high-orders, this has a
reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.
Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If there is a zone below ZONE_NORMAL has present_pages, we can set node
state to N_NORMAL_MEMORY, no need to loop to end.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.
Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__free_pages_bootmem() used to special-case higher-order frees to save
individual page checking with free_pages_bulk().
Nowadays, both zero order and non-zero order frees use free_pages(), which
checks each individual page anyway, and so there is little point in making
the distinction anymore. The higher-order loop will work just fine for
zero order pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 88f5acf88a ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 << order) -
1) so we ended up with off-by-two when calculating free_pages.
Reported-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.
This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.
But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.
Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.
As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.
At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.
For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.
Test results
15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000
fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000
btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000
ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.
This patch:
The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.
[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
on access (read,write) to an unallocated page, which permits us to catch
code which corrupts memory. However the kernel is trying to maximise
memory usage, hence there are usually few free pages in the system and
buggy code usually corrupts some crucial data.
This patch changes the buddy allocator to keep more free/protected pages
and to interlace free/protected and allocated pages to increase the
probability of catching corruption.
When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
debug_guardpage_minorder defines the minimum order used by the page
allocator to grant a request. The requested size will be returned with
the remaining pages used as guard pages.
The default value of debug_guardpage_minorder is zero: no change from
current behaviour.
[akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Cross reported;
Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.
In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set. The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.
The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.
This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used. Failing allocations like this may
cause suspend to abort but that is better than a livelock.
[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.
The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary. This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
mm_page_free_batched
Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
all freed pages, not only for directly freed. So, let's name it properly.
For pages freed via page-list we also trigger mm_page_free_batched event.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It not exported and now nobody uses it.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
Kconfig: acpi: Fix typo in comment.
misc latin1 to utf8 conversions
devres: Fix a typo in devm_kfree comment
btrfs: free-space-cache.c: remove extra semicolon.
fat: Spelling s/obsolate/obsolete/g
SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
tools/power turbostat: update fields in manpage
mac80211: drop spelling fix
types.h: fix comment spelling for 'architectures'
typo fixes: aera -> area, exntension -> extension
devices.txt: Fix typo of 'VMware'.
sis900: Fix enum typo 'sis900_rx_bufer_status'
decompress_bunzip2: remove invalid vi modeline
treewide: Fix comment and string typo 'bufer'
hyper-v: Update MAINTAINERS
treewide: Fix typos in various parts of the kernel, and fix some comments.
clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
leds: Kconfig: Fix typo 'D2NET_V2'
sound: Kconfig: drop unknown symbol ARCH_CLPS7500
...
Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
[<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
[<c08a771c>] init_per_zone_wmark_min+0x27/0x86
[<c020111b>] do_one_initcall+0x2b/0x160
[<c086639d>] kernel_init+0xbe/0x157
[<c05cae26>] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4
We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.
The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").
Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Dang Bo <bdang@vmware.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arve Hjnnevg <arve@android.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 70b50f94f1 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.
kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
there's no user of early_node_map[] left. Kill early_node_map[] and
replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
as page_alloc.c would no longer host an alternative implementation.
This change is ultimately one to one mapping and shouldn't cause any
observable difference; however, after the recent changes, there are
some functions which now would fit memblock.c better than page_alloc.c
and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
doesn't make much sense on some of them. Further cleanups for
functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.
-v2: Fix compile bug introduced by mis-spelling
CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
mmzone.h. Reported by Stephen Rothwell.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Conflicts & resolutions:
* arch/x86/xen/setup.c
dc91c728fd "xen: allow extra memory to be in multiple regions"
24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."
conflicted on xen_add_extra_mem() updates. The resolution is
trivial as the latter just want to replace
memblock_x86_reserve_range() with memblock_reserve().
* drivers/pci/intel-iommu.c
166e9278a3 "x86/ia64: intel-iommu: move to drivers/iommu/"
5dfe8660a3 "bootmem: Replace work_with_active_regions() with..."
conflicted as the former moved the file under drivers/iommu/.
Resolved by applying the chnages from the latter on the moved
file.
* mm/Kconfig
6661672053 "memblock: add NO_BOOTMEM config symbol"
c378ddd53f "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"
conflicted trivially. Both added config options. Just
letting both add their own options resolves the conflict.
* mm/memblock.c
d1f0ece6cd "mm/memblock.c: small function definition fixes"
ed7b56a799 "memblock: Remove memblock_memory_can_coalesce()"
confliected. The former updates function removed by the
latter. Resolution is trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Only tail pages point at the head page using their ->first_page fields.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.
[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we get a bad_page bug report, it's useful to see what modules the
user had loaded.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
init_fault_attr_dentries() is used to export fault_attr via debugfs.
But it can only export it in debugfs root directory.
Per Forlin is working on mmc_fail_request which adds support to inject
data errors after a completed host transfer in MMC subsystem.
The fault_attr for mmc_fail_request should be defined per mmc host and
export it in debugfs directory per mmc host like
/sys/kernel/debug/mmc0/mmc_fail_request.
init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
introduces fault_create_debugfs_attr() which is able to create a
directory in the arbitrary directory and replace
init_fault_attr_dentries().
[akpm@linux-foundation.org: extraneous semicolon, per Randy]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Tested-by: Per Forlin <per.forlin@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now cleanup_fault_attr_dentries() recursively removes a directory, So we
can simplify the error handling in the initialization code and no need
to hold dentry structs for each debugfs file.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use debugfs_remove_recursive() to simplify initialization and
deinitialization of fault injection debugfs files.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With zone_reclaim_mode enabled, it's possible for zones to be considered
full in the zonelist_cache so they are skipped in the future. If the
process enters direct reclaim, the ZLC may still consider zones to be full
even after reclaiming pages. Reconsider all zones for allocation if
direct reclaim returns successfully.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There have been a small number of complaints about significant stalls
while copying large amounts of data on NUMA machines reported on a
distribution bugzilla. In these cases, zone_reclaim was enabled by
default due to large NUMA distances. In general, the complaints have not
been about the workload itself unless it was a file server (in which case
the recommendation was disable zone_reclaim).
The stalls are mostly due to significant amounts of time spent scanning
the preferred zone for pages to free. After a failure, it might fallback
to another node (as zonelists are often node-ordered rather than
zone-ordered) but stall quickly again when the next allocation attempt
occurs. In bad cases, each page allocated results in a full scan of the
preferred zone.
Patch 1 checks the preferred zone for recent allocation failure
which is particularly important if zone_reclaim has failed
recently. This avoids rescanning the zone in the near future and
instead falling back to another node. This may hurt node locality
in some cases but a failure to zone_reclaim is more expensive than
a remote access.
Patch 2 clears the zlc information after direct reclaim.
Otherwise, zone_reclaim can mark zones full, direct reclaim can
reclaim enough pages but the zone is still not considered for
allocation.
This was tested on a 24-thread 2-node x86_64 machine. The tests were
focused on large amounts of IO. All tests were bound to the CPUs on
node-0 to avoid disturbances due to processes being scheduled on different
nodes. The kernels tested are
3.0-rc6-vanilla Vanilla 3.0-rc6
zlcfirst Patch 1 applied
zlcreconsider Patches 1+2 applied
FS-Mark
./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
fsmark-3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirs zlcreconsider
Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 501.49 493.91 499.93
Total Elapsed Time (seconds) 2451.57 2257.48 2215.92
MMTests Statistics: vmstat
Page Ins 46268 63840 66008
Page Outs 90821596 90671128 88043732
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 13091697 8966863 8971790
Kswapd pages scanned 0 1830011 1831116
Kswapd pages reclaimed 0 1829068 1829930
Direct pages reclaimed 13037777 8956828 8648314
Kswapd efficiency 100% 99% 99%
Kswapd velocity 0.000 810.643 826.346
Direct efficiency 99% 99% 96%
Direct velocity 5340.128 3972.068 4048.788
Percentage direct scans 100% 83% 83%
Page writes by reclaim 0 3 0
Slabs scanned 796672 720640 720256
Direct inode steals 7422667 7160012 7088638
Kswapd inode steals 0 1736840 2021238
Test completes far faster with a large increase in the number of files
created per second. Standard deviation is high as a small number of
iterations were much higher than the mean. The number of pages scanned by
zone_reclaim is reduced and kswapd is used for more work.
LARGE DD
3.0-rc6 3.0-rc6 3.0-rc6
vanilla zlcfirst zlcreconsider
download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 125.03 118.98 122.01
Total Elapsed Time (seconds) 624.56 375.02 398.06
MMTests Statistics: vmstat
Page Ins 3594216 439368 407032
Page Outs 23380832 23380488 23377444
Swap Ins 0 0 0
Swap Outs 0 436 287
Direct pages scanned 17482342 69315973 82864918
Kswapd pages scanned 0 519123 575425
Kswapd pages reclaimed 0 466501 522487
Direct pages reclaimed 5858054 2732949 2712547
Kswapd efficiency 100% 89% 90%
Kswapd velocity 0.000 1384.254 1445.574
Direct efficiency 33% 3% 3%
Direct velocity 27991.453 184832.737 208171.929
Percentage direct scans 100% 99% 99%
Page writes by reclaim 0 5082 13917
Slabs scanned 17280 29952 35328
Direct inode steals 115257 1431122 332201
Kswapd inode steals 0 0 979532
This test downloads a large tarfile and copies it with dd a number of
times - similar to the most recent bug report I've dealt with. Time to
completion is reduced. The number of pages scanned directly is still
disturbingly high with a low efficiency but this is likely due to the
number of dirty pages encountered. The figures could probably be improved
with more work around how kswapd is used and how dirty pages are handled
but that is separate work and this result is significant on its own.
Streaming Mapped Writer
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 124.47 111.67 112.64
Total Elapsed Time (seconds) 2138.14 1816.30 1867.56
MMTests Statistics: vmstat
Page Ins 90760 89124 89516
Page Outs 121028340 120199524 120736696
Swap Ins 0 86 55
Swap Outs 0 0 0
Direct pages scanned 114989363 96461439 96330619
Kswapd pages scanned 56430948 56965763 57075875
Kswapd pages reclaimed 27743219 27752044 27766606
Direct pages reclaimed 49777 46884 36655
Kswapd efficiency 49% 48% 48%
Kswapd velocity 26392.541 31363.631 30561.736
Direct efficiency 0% 0% 0%
Direct velocity 53780.091 53108.759 51581.004
Percentage direct scans 67% 62% 62%
Page writes by reclaim 385 122 1513
Slabs scanned 43008 39040 42112
Direct inode steals 0 10 8
Kswapd inode steals 733 534 477
This test just creates a large file mapping and writes to it linearly.
Time to completion is again reduced.
The gains are mostly down to two things. In many cases, there is less
scanning as zone_reclaim simply gives up faster due to recent failures.
The second reason is that memory is used more efficiently. Instead of
scanning the preferred zone every time, the allocator falls back to
another zone and uses it instead improving overall memory utilisation.
This patch: initialise ZLC for first zone eligible for zone_reclaim.
The zonelist cache (ZLC) is used among other things to record if
zone_reclaim() failed for a particular zone recently. The intention is to
avoid a high cost scanning extremely long zonelists or scanning within the
zone uselessly.
Currently the zonelist cache is setup only after the first zone has been
considered and zone_reclaim() has been called. The objective was to avoid
a costly setup but zone_reclaim is itself quite expensive. If it is
failing regularly such as the first eligible zone having mostly mapped
pages, the cost in scanning and allocation stalls is far higher than the
ZLC initialisation step.
This patch initialises ZLC before the first eligible zone calls
zone_reclaim(). Once initialised, it is checked whether the zone failed
zone_reclaim recently. If it has, the zone is skipped. As the first zone
is now being checked, additional care has to be taken about zones marked
full. A zone can be marked "full" because it should not have enough
unmapped pages for zone_reclaim but this is excessive as direct reclaim or
kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
zone_reclaim fails if it failed to reclaim enough pages after scanning.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From 83103b92f3234ec830852bbc5c45911bd6cbdb20 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 14 Jul 2011 11:22:16 +0200
Add optional region->nid which can be enabled by arch using
CONFIG_HAVE_MEMBLOCK_NODE_MAP. When enabled, memblock also carries
NUMA node information and replaces early_node_map[].
Newly added memblocks have MAX_NUMNODES as nid. Arch can then call
memblock_set_node() to set node information. memblock takes care of
merging and node affine allocations w.r.t. node information.
When MEMBLOCK_NODE_MAP is enabled, early_node_map[], related data
structures and functions to manipulate and iterate it are disabled.
memblock version of __next_mem_pfn_range() is provided such that
for_each_mem_pfn_range() behaves the same and its users don't have to
be updated.
-v2: Yinghai spotted section mismatch caused by missing
__init_memblock in memblock_set_node(). Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714094342.GF3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
With the previous changes, generic NUMA aware memblock API has feature
parity with memblock_x86_find_in_range_node(). There currently are
two users - x86 setup_node_data() and __alloc_memory_core_early() in
nobootmem.c.
This patch converts the former to use memblock_alloc_nid() and the
latter memblock_find_range_in_node(), and kills
memblock_x86_find_in_range_node() and related functions including
find_memory_early_core_early() in page_alloc.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310460395-30913-9-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The previous patch added for_each_mem_pfn_range() which is more
versatile than for_each_active_range_index_in_nid(). This patch
replaces for_each_active_range_index_in_nid() and open coded
early_node_map[] walks with for_each_mem_pfn_range().
All conversions in this patch are straight-forward and shouldn't cause
any functional difference. After the conversions,
for_each_active_range_index_in_nid() doesn't have any user left and is
removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310460395-30913-4-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
__absent_pages_in_range() was needlessly complex. Reimplement it
using for_each_mem_pfn_range().
Also, update zone_absent_pages_in_node() such that it doesn't call
__absent_pages_in_range() with @zone_start_pfn which is larger than
@zone_end_pfn.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310460395-30913-3-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Callback based iteration is cumbersome and much less useful than
for_each_*() iterator. This patch implements for_each_mem_pfn_range()
which replaces work_with_active_regions(). All the current users of
work_with_active_regions() are converted.
This simplifies walking over early_node_map and will allow converting
internal logics in page_alloc to use iterator instead of walking
early_node_map directly, which in turn will enable moving node
information to memblock.
powerpc change is only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made
MEMBLOCK_ERROR 0 and there already are codes which expect error return
to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its
misery.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
a226f6c899 (FRV: Clean up bootmem allocator's page freeing algorithm)
separated out __free_pages_bootmem() from free_all_bootmem_core().
__free_pages_bootmem() takes @order argument but it assumes @order is
either 0 or ilog2(BITS_PER_LONG). Note that all the current users
match that assumption and this doesn't cause actual problems.
Fix it by using 1 << order instead of BITS_PER_LONG.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310457490-3356-3-git-send-email-tj@kernel.org
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
sections array to map pfn to nid which is limited in granularity. If
NUMA nodes are laid out such that the mapping cannot be accurate, boot
will fail triggering BUG_ON() in mminit_verify_page_links().
On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been
granular enough until commit 2706a0bf7b (x86, NUMA: Enable
CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which
aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This
led to the following BUG_ON().
On node 0 totalpages: 2096615
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 3927 pages, LIFO batch:0
Normal zone: 1740 pages used for memmap
Normal zone: 220978 pages, LIFO batch:31
HighMem zone: 16405 pages used for memmap
HighMem zone: 1853533 pages, LIFO batch:31
BUG: Int 6: CR2 (null)
EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
err (null) EIP c16209aa CS 00000060 flg 00010002
Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
(null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
Call Trace:
[<c136b1e5>] ? early_fault+0x2e/0x2e
[<c16209aa>] ? mminit_verify_page_links+0x12/0x42
[<c1620613>] ? memmap_init_zone+0xaf/0x10c
[<c1620929>] ? free_area_init_node+0x2b9/0x2e3
[<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
[<c1601d80>] ? paging_init+0x112/0x118
[<c15f578d>] ? setup_arch+0x791/0x82f
[<c15f43d9>] ? start_kernel+0x6a/0x257
This patch implements node_map_pfn_alignment() which determines
maximum internode alignment and update numa_register_memblks() to
reject NUMA configuration if alignment exceeds the pfn -> nid mapping
granularity of the memory model as determined by PAGES_PER_SECTION.
This makes the problematic machine boot w/ flatmem by rejecting the
NUMA config and provides protection against crazy NUMA configurations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
Reported-and-Tested-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Conny Seidel <conny.seidel@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This reverts commit a197b59ae6.
As rmk says:
"Commit a197b59ae6 (mm: fail GFP_DMA allocations when ZONE_DMA is not
configured) is causing regressions on ARM with various drivers which
use GFP_DMA.
The behaviour up until now has been to silently ignore that flag when
CONFIG_ZONE_DMA is not enabled, and to allocate from the normal zone.
However, as a result of the above commit, such allocations now fail
which causes drivers to fail. These are regressions compared to the
previous kernel version."
so just revert it.
Requested-by: Russell King <linux@arm.linux.org.uk>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During memory reclaim we determine the number of pages to be scanned per
zone as
(anon + file) >> priority.
Assume
scan = (anon + file) >> priority.
If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
priority gets higher. This has some problems.
1. This increases priority as 1 without any scan.
To do scan in this priority, amount of pages should be larger than 512M.
If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
batched, later. (But we lose 1 priority.)
If memory size is below 16M, pages >> priority is 0 and no scan in
DEF_PRIORITY forever.
2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
So, x86's ZONE_DMA will never be recoverred until the user of pages
frees memory by itself.
3. With memcg, the limit of memory can be small. When using small memcg,
it gets priority < DEF_PRIORITY-2 very easily and need to call
wait_iff_congested().
For doing scan before priorty=9, 64MB of memory should be used.
Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
1. the target is enough small.
2. it's kswapd or memcg reclaim.
Then we can avoid rapid priority drop and may be able to recover
all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
This will allow scanning in this priority even when pages >> priority is
very small.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ying Han <yinghan@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I believe I found a problem in __alloc_pages_slowpath, which allows a
process to get stuck endlessly looping, even when lots of memory is
available.
Running an I/O and memory intensive stress-test I see a 0-order page
allocation with __GFP_IO and __GFP_WAIT, running on a system with very
little free memory. Right about the same time that the stress-test gets
killed by the OOM-killer, the utility trying to allocate memory gets stuck
in __alloc_pages_slowpath even though most of the systems memory was freed
by the oom-kill of the stress-test.
The utility ends up looping from the rebalance label down through the
wait_iff_congested continiously. Because order=0,
__alloc_pages_direct_compact skips the call to get_page_from_freelist.
Because all of the reclaimable memory on the system has already been
reclaimed, __alloc_pages_direct_reclaim skips the call to
get_page_from_freelist. Since there is no __GFP_FS flag, the block with
__alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested,
then jumps back to rebalance without ever trying to
get_page_from_freelist. This loop repeats infinitely.
The test case is pretty pathological. Running a mix of I/O stress-tests
that do a lot of fork() and consume all of the system memory, I can pretty
reliably hit this on 600 nodes, in about 12 hours. 32GB/node.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator will improperly return a page from ZONE_NORMAL even
when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller
expects DMA memory, perhaps for ISA devices with 16-bit address registers,
and may get higher memory resulting in undefined behavior.
This patch causes the page allocator to return NULL in such circumstances
with a warning emitted to the kernel log on the first occurrence.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>