2005-10-30 09:16:54 +08:00
|
|
|
/*
|
|
|
|
* linux/mm/memory_hotplug.c
|
|
|
|
*
|
|
|
|
* Copyright (C)
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/stddef.h>
|
|
|
|
#include <linux/mm.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/compiler.h>
|
2011-10-16 14:01:52 +08:00
|
|
|
#include <linux/export.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/pagevec.h>
|
2006-09-29 17:01:25 +08:00
|
|
|
#include <linux/writeback.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/memory.h>
|
2016-01-16 08:56:22 +08:00
|
|
|
#include <linux/memremap.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/memory_hotplug.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/vmalloc.h>
|
2006-06-27 17:53:35 +08:00
|
|
|
#include <linux/ioport.h>
|
2007-10-16 16:26:12 +08:00
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <linux/migrate.h>
|
|
|
|
#include <linux/page-isolation.h>
|
2008-10-19 11:25:58 +08:00
|
|
|
#include <linux/pfn.h>
|
2009-11-18 06:06:22 +08:00
|
|
|
#include <linux/suspend.h>
|
2009-12-15 09:58:11 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2010-03-06 05:41:58 +08:00
|
|
|
#include <linux/firmware-map.h>
|
2013-02-23 08:33:14 +08:00
|
|
|
#include <linux/stop_machine.h>
|
2013-09-12 05:22:09 +08:00
|
|
|
#include <linux/hugetlb.h>
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
#include <linux/memblock.h>
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
#include <linux/compaction.h>
|
hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.
Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===
int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);
fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");
sleep (20);
ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
return 0;
}
===
+ offline this memory.
In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel: [<ffffffff81215f08>] do_execve+0x28/0x30
kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80
And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.
With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.
Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.
Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:
: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.
Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.com>
Debugged-by: Oscar Salvador <osalvador@suse.com>
Tested-by: Oscar Salvador <osalvador@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:38:01 +08:00
|
|
|
#include <linux/rmap.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
|
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
|
2008-04-29 01:40:08 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
/*
|
|
|
|
* online_page_callback contains pointer to current page onlining function.
|
|
|
|
* Initially it is generic_online_page(). If it is required it could be
|
|
|
|
* changed by calling set_online_page_callback() for callback registration
|
|
|
|
* and restore_online_page_callback() for generic callback restore.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void generic_online_page(struct page *page);
|
|
|
|
|
|
|
|
static online_page_callback_t online_page_callback = generic_online_page;
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
static DEFINE_MUTEX(online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
2017-07-11 06:50:09 +08:00
|
|
|
DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
|
2017-07-11 06:50:09 +08:00
|
|
|
void get_online_mems(void)
|
|
|
|
{
|
|
|
|
percpu_down_read(&mem_hotplug_lock);
|
|
|
|
}
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
|
2017-07-11 06:50:09 +08:00
|
|
|
void put_online_mems(void)
|
|
|
|
{
|
|
|
|
percpu_up_read(&mem_hotplug_lock);
|
|
|
|
}
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
|
2017-07-07 06:41:05 +08:00
|
|
|
bool movable_node_enabled = false;
|
|
|
|
|
2016-05-20 08:13:03 +08:00
|
|
|
#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
|
2016-03-16 05:56:48 +08:00
|
|
|
bool memhp_auto_online;
|
2016-05-20 08:13:03 +08:00
|
|
|
#else
|
|
|
|
bool memhp_auto_online = true;
|
|
|
|
#endif
|
2016-03-16 05:56:48 +08:00
|
|
|
EXPORT_SYMBOL_GPL(memhp_auto_online);
|
|
|
|
|
2016-05-20 08:13:06 +08:00
|
|
|
static int __init setup_memhp_default_state(char *str)
|
|
|
|
{
|
|
|
|
if (!strcmp(str, "online"))
|
|
|
|
memhp_auto_online = true;
|
|
|
|
else if (!strcmp(str, "offline"))
|
|
|
|
memhp_auto_online = false;
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("memhp_default_state=", setup_memhp_default_state);
|
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
void mem_hotplug_begin(void)
|
2010-12-03 06:31:19 +08:00
|
|
|
{
|
2017-07-11 06:50:09 +08:00
|
|
|
cpus_read_lock();
|
|
|
|
percpu_down_write(&mem_hotplug_lock);
|
2010-12-03 06:31:19 +08:00
|
|
|
}
|
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
void mem_hotplug_done(void)
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
{
|
2017-07-11 06:50:09 +08:00
|
|
|
percpu_up_write(&mem_hotplug_lock);
|
|
|
|
cpus_read_unlock();
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
}
|
2010-12-03 06:31:19 +08:00
|
|
|
|
2006-10-01 14:27:09 +08:00
|
|
|
/* add this memory to iomem resource */
|
|
|
|
static struct resource *register_memory_resource(u64 start, u64 size)
|
|
|
|
{
|
2017-09-09 07:11:43 +08:00
|
|
|
struct resource *res, *conflict;
|
2006-10-01 14:27:09 +08:00
|
|
|
res = kzalloc(sizeof(struct resource), GFP_KERNEL);
|
2016-01-15 07:21:55 +08:00
|
|
|
if (!res)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2006-10-01 14:27:09 +08:00
|
|
|
|
|
|
|
res->name = "System RAM";
|
|
|
|
res->start = start;
|
|
|
|
res->end = start + size - 1;
|
2016-01-27 04:57:24 +08:00
|
|
|
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
|
2017-09-09 07:11:43 +08:00
|
|
|
conflict = request_resource_conflict(&iomem_resource, res);
|
|
|
|
if (conflict) {
|
|
|
|
if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
|
|
|
|
pr_debug("Device unaddressable memory block "
|
|
|
|
"memory hotplug at %#010llx !\n",
|
|
|
|
(unsigned long long)start);
|
|
|
|
}
|
2013-07-04 06:02:39 +08:00
|
|
|
pr_debug("System RAM resource %pR cannot be added\n", res);
|
2006-10-01 14:27:09 +08:00
|
|
|
kfree(res);
|
2016-01-15 07:21:55 +08:00
|
|
|
return ERR_PTR(-EEXIST);
|
2006-10-01 14:27:09 +08:00
|
|
|
}
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_memory_resource(struct resource *res)
|
|
|
|
{
|
|
|
|
if (!res)
|
|
|
|
return;
|
|
|
|
release_resource(res);
|
|
|
|
kfree(res);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-10-01 14:27:08 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
2013-02-23 08:33:00 +08:00
|
|
|
void get_page_bootmem(unsigned long info, struct page *page,
|
|
|
|
unsigned long type)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
2017-02-23 07:45:13 +08:00
|
|
|
page->freelist = (void *)type;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
SetPagePrivate(page);
|
|
|
|
set_page_private(page, info);
|
2016-03-18 05:19:26 +08:00
|
|
|
page_ref_inc(page);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-04 06:03:17 +08:00
|
|
|
void put_page_bootmem(struct page *page)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
2011-01-14 07:47:00 +08:00
|
|
|
unsigned long type;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2017-02-23 07:45:13 +08:00
|
|
|
type = (unsigned long) page->freelist;
|
2011-01-14 07:47:00 +08:00
|
|
|
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
|
|
|
|
type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2016-03-18 05:19:26 +08:00
|
|
|
if (page_ref_dec_return(page) == 1) {
|
2017-02-23 07:45:13 +08:00
|
|
|
page->freelist = NULL;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(page, 0);
|
2011-01-14 07:47:00 +08:00
|
|
|
INIT_LIST_HEAD(&page->lru);
|
2013-07-04 06:03:17 +08:00
|
|
|
free_reserved_page(page);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-02-23 08:33:00 +08:00
|
|
|
#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
|
|
|
|
#ifndef CONFIG_SPARSEMEM_VMEMMAP
|
2008-07-24 12:28:12 +08:00
|
|
|
static void register_page_bootmem_info_section(unsigned long start_pfn)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
|
|
|
unsigned long *usemap, mapsize, section_nr, i;
|
|
|
|
struct mem_section *ms;
|
|
|
|
struct page *page, *memmap;
|
|
|
|
|
|
|
|
section_nr = pfn_to_section_nr(start_pfn);
|
|
|
|
ms = __nr_to_section(section_nr);
|
|
|
|
|
|
|
|
/* Get section's memmap address */
|
|
|
|
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get page for the memmap's phys address
|
|
|
|
* XXX: need more consideration for sparse_vmemmap...
|
|
|
|
*/
|
|
|
|
page = virt_to_page(memmap);
|
|
|
|
mapsize = sizeof(struct page) * PAGES_PER_SECTION;
|
|
|
|
mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
/* remember memmap's page */
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
|
|
|
get_page_bootmem(section_nr, page, SECTION_INFO);
|
|
|
|
|
2018-02-01 08:17:25 +08:00
|
|
|
usemap = ms->pageblock_flags;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
page = virt_to_page(usemap);
|
|
|
|
|
|
|
|
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
2008-07-24 12:28:17 +08:00
|
|
|
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
|
|
|
}
|
2013-02-23 08:33:00 +08:00
|
|
|
#else /* CONFIG_SPARSEMEM_VMEMMAP */
|
|
|
|
static void register_page_bootmem_info_section(unsigned long start_pfn)
|
|
|
|
{
|
|
|
|
unsigned long *usemap, mapsize, section_nr, i;
|
|
|
|
struct mem_section *ms;
|
|
|
|
struct page *page, *memmap;
|
|
|
|
|
|
|
|
section_nr = pfn_to_section_nr(start_pfn);
|
|
|
|
ms = __nr_to_section(section_nr);
|
|
|
|
|
|
|
|
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
|
|
|
|
|
|
|
|
register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
|
|
|
|
|
2018-02-01 08:17:25 +08:00
|
|
|
usemap = ms->pageblock_flags;
|
2013-02-23 08:33:00 +08:00
|
|
|
page = virt_to_page(usemap);
|
|
|
|
|
|
|
|
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
|
|
|
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
|
|
|
|
}
|
|
|
|
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2016-05-28 06:23:32 +08:00
|
|
|
void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
|
|
|
unsigned long i, pfn, end_pfn, nr_pages;
|
|
|
|
int node = pgdat->node_id;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
|
|
|
|
page = virt_to_page(pgdat);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++, page++)
|
|
|
|
get_page_bootmem(node, page, NODE_INFO);
|
|
|
|
|
|
|
|
pfn = pgdat->node_start_pfn;
|
2013-02-23 08:35:32 +08:00
|
|
|
end_pfn = pgdat_end_pfn(pgdat);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2013-07-09 07:00:23 +08:00
|
|
|
/* register section info */
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
/*
|
|
|
|
* Some platforms can assign the same pfn to multiple nodes - on
|
|
|
|
* node0 as well as nodeN. To avoid registering a pfn against
|
|
|
|
* multiple nodes we check that this pfn does not already
|
2013-07-09 07:00:23 +08:00
|
|
|
* reside in some other nodes.
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
*/
|
2016-05-28 05:27:32 +08:00
|
|
|
if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
register_page_bootmem_info_section(pfn);
|
|
|
|
}
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
2013-02-23 08:33:00 +08:00
|
|
|
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
|
2017-12-29 15:53:54 +08:00
|
|
|
struct vmem_altmap *altmap, bool want_memblock)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2006-08-06 03:15:06 +08:00
|
|
|
if (pfn_valid(phys_start_pfn))
|
|
|
|
return -EEXIST;
|
|
|
|
|
2018-12-28 16:37:06 +08:00
|
|
|
ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
|
2005-10-30 09:16:54 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2017-07-07 06:37:45 +08:00
|
|
|
if (!want_memblock)
|
|
|
|
return 0;
|
|
|
|
|
2018-04-06 07:22:56 +08:00
|
|
|
return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
/*
|
|
|
|
* Reasonably generic function for adding memory. It is
|
|
|
|
* expected that archs that support memory hotplug will
|
|
|
|
* call this function after deciding the zone to which to
|
|
|
|
* add the new pages.
|
|
|
|
*/
|
2017-07-07 06:38:11 +08:00
|
|
|
int __ref __add_pages(int nid, unsigned long phys_start_pfn,
|
2017-12-29 15:53:53 +08:00
|
|
|
unsigned long nr_pages, struct vmem_altmap *altmap,
|
|
|
|
bool want_memblock)
|
2013-04-30 06:08:22 +08:00
|
|
|
{
|
|
|
|
unsigned long i;
|
|
|
|
int err = 0;
|
|
|
|
int start_sec, end_sec;
|
2016-01-16 08:56:22 +08:00
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
/* during initialize mem_map, align hot-added range to section */
|
|
|
|
start_sec = pfn_to_section_nr(phys_start_pfn);
|
|
|
|
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
if (altmap) {
|
|
|
|
/*
|
|
|
|
* Validate altmap is within bounds of the total request
|
|
|
|
*/
|
|
|
|
if (altmap->base_pfn != phys_start_pfn
|
|
|
|
|| vmem_altmap_offset(altmap) > nr_pages) {
|
|
|
|
pr_warn_once("memory add fail, invalid altmap\n");
|
2016-03-16 05:57:51 +08:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
2016-01-16 08:56:22 +08:00
|
|
|
}
|
|
|
|
altmap->alloc = 0;
|
|
|
|
}
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
for (i = start_sec; i <= end_sec; i++) {
|
2017-12-29 15:53:54 +08:00
|
|
|
err = __add_section(nid, section_nr_to_pfn(i), altmap,
|
|
|
|
want_memblock);
|
2013-04-30 06:08:22 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* EEXIST is finally dealt with by ioresource collision
|
|
|
|
* check. see add_memory() => register_memory_resource()
|
|
|
|
* Warning will be printed if there is collision.
|
|
|
|
*/
|
|
|
|
if (err && (err != -EEXIST))
|
|
|
|
break;
|
|
|
|
err = 0;
|
2017-10-04 07:16:16 +08:00
|
|
|
cond_resched();
|
2013-04-30 06:08:22 +08:00
|
|
|
}
|
2015-06-25 07:58:42 +08:00
|
|
|
vmemmap_populate_print_last();
|
2016-03-16 05:57:51 +08:00
|
|
|
out:
|
2013-04-30 06:08:22 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2013-02-23 08:33:12 +08:00
|
|
|
/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
|
2017-10-04 07:16:32 +08:00
|
|
|
static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
struct mem_section *ms;
|
|
|
|
|
|
|
|
for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(start_pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (unlikely(pfn_to_nid(start_pfn) != nid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zone && zone != page_zone(pfn_to_page(start_pfn)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return start_pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
|
2017-10-04 07:16:32 +08:00
|
|
|
static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
struct mem_section *ms;
|
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
/* pfn is the end pfn of a memory section. */
|
|
|
|
pfn = end_pfn - 1;
|
|
|
|
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (unlikely(pfn_to_nid(pfn) != nid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zone && zone != page_zone(pfn_to_page(pfn)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
2013-09-12 05:21:44 +08:00
|
|
|
unsigned long zone_start_pfn = zone->zone_start_pfn;
|
|
|
|
unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
|
|
|
|
unsigned long zone_end_pfn = z;
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
struct mem_section *ms;
|
|
|
|
int nid = zone_to_nid(zone);
|
|
|
|
|
|
|
|
zone_span_writelock(zone);
|
|
|
|
if (zone_start_pfn == start_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is smallest section in the zone, it need
|
|
|
|
* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
|
|
|
|
* In this case, we find second smallest valid mem_section
|
|
|
|
* for shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_smallest_section_pfn(nid, zone, end_pfn,
|
|
|
|
zone_end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
zone->zone_start_pfn = pfn;
|
|
|
|
zone->spanned_pages = zone_end_pfn - pfn;
|
|
|
|
}
|
|
|
|
} else if (zone_end_pfn == end_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is biggest section in the zone, it need
|
|
|
|
* shrink zone->spanned_pages.
|
|
|
|
* In this case, we find second biggest valid mem_section for
|
|
|
|
* shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
|
|
|
|
start_pfn);
|
|
|
|
if (pfn)
|
|
|
|
zone->spanned_pages = pfn - zone_start_pfn + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The section is not biggest or smallest mem_section in the zone, it
|
|
|
|
* only creates a hole in the zone. So in this case, we need not
|
|
|
|
* change the zone. But perhaps, the zone has only hole data. Thus
|
|
|
|
* it check the zone has only hole or not.
|
|
|
|
*/
|
|
|
|
pfn = zone_start_pfn;
|
|
|
|
for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (page_zone(pfn_to_page(pfn)) != zone)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If the section is current section, it continues the loop */
|
|
|
|
if (start_pfn == pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we find valid section, we have nothing to do */
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The zone has no valid section */
|
|
|
|
zone->zone_start_pfn = 0;
|
|
|
|
zone->spanned_pages = 0;
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_pgdat_span(struct pglist_data *pgdat,
|
|
|
|
unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2013-11-13 07:07:19 +08:00
|
|
|
unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
|
|
|
|
unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
|
|
|
|
unsigned long pgdat_end_pfn = p;
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
struct mem_section *ms;
|
|
|
|
int nid = pgdat->node_id;
|
|
|
|
|
|
|
|
if (pgdat_start_pfn == start_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is smallest section in the pgdat, it need
|
|
|
|
* shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
|
|
|
|
* In this case, we find second smallest valid mem_section
|
|
|
|
* for shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
|
|
|
|
pgdat_end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
pgdat->node_start_pfn = pfn;
|
|
|
|
pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
|
|
|
|
}
|
|
|
|
} else if (pgdat_end_pfn == end_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is biggest section in the pgdat, it need
|
|
|
|
* shrink pgdat->node_spanned_pages.
|
|
|
|
* In this case, we find second biggest valid mem_section for
|
|
|
|
* shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
|
|
|
|
start_pfn);
|
|
|
|
if (pfn)
|
|
|
|
pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the section is not biggest or smallest mem_section in the pgdat,
|
|
|
|
* it only creates a hole in the pgdat. So in this case, we need not
|
|
|
|
* change the pgdat.
|
|
|
|
* But perhaps, the pgdat has only hole data. Thus it check the pgdat
|
|
|
|
* has only hole or not.
|
|
|
|
*/
|
|
|
|
pfn = pgdat_start_pfn;
|
|
|
|
for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (pfn_to_nid(pfn) != nid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If the section is current section, it continues the loop */
|
|
|
|
if (start_pfn == pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we find valid section, we have nothing to do */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The pgdat has no valid section */
|
|
|
|
pgdat->node_start_pfn = 0;
|
|
|
|
pgdat->node_spanned_pages = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __remove_zone(struct zone *zone, unsigned long start_pfn)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
int nr_pages = PAGES_PER_SECTION;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
|
|
|
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
|
|
|
|
shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
|
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
|
|
|
}
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
static int __remove_section(struct zone *zone, struct mem_section *ms,
|
2017-12-29 15:53:56 +08:00
|
|
|
unsigned long map_offset, struct vmem_altmap *altmap)
|
2008-04-28 17:12:01 +08:00
|
|
|
{
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn;
|
|
|
|
int scn_nr;
|
2008-04-28 17:12:01 +08:00
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
if (!valid_section(ms))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = unregister_memory_section(ms);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2013-02-23 08:33:12 +08:00
|
|
|
scn_nr = __section_nr(ms);
|
2017-10-04 07:16:29 +08:00
|
|
|
start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
|
2013-02-23 08:33:12 +08:00
|
|
|
__remove_zone(zone, start_pfn);
|
|
|
|
|
2017-12-29 15:53:56 +08:00
|
|
|
sparse_remove_one_section(zone, ms, map_offset, altmap);
|
2008-04-28 17:12:01 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __remove_pages() - remove sections of pages from a zone
|
|
|
|
* @zone: zone from which pages need to be removed
|
|
|
|
* @phys_start_pfn: starting pageframe (must be aligned to start of a section)
|
|
|
|
* @nr_pages: number of pages to remove (must be multiple of section size)
|
2018-04-06 07:24:57 +08:00
|
|
|
* @altmap: alternative device page map or %NULL if default memmap is used
|
2008-04-28 17:12:01 +08:00
|
|
|
*
|
|
|
|
* Generic helper function to remove section mappings and sysfs entries
|
|
|
|
* for the section of the memory we are removing. Caller needs to make
|
|
|
|
* sure that pages are marked reserved and zones are adjust properly by
|
|
|
|
* calling offline_pages().
|
|
|
|
*/
|
|
|
|
int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
|
2017-12-29 15:53:55 +08:00
|
|
|
unsigned long nr_pages, struct vmem_altmap *altmap)
|
2008-04-28 17:12:01 +08:00
|
|
|
{
|
2013-04-30 06:08:20 +08:00
|
|
|
unsigned long i;
|
2016-01-16 08:56:22 +08:00
|
|
|
unsigned long map_offset = 0;
|
|
|
|
int sections_to_remove, ret = 0;
|
|
|
|
|
|
|
|
/* In the ZONE_DEVICE case device driver owns the memory region */
|
|
|
|
if (is_dev_zone(zone)) {
|
|
|
|
if (altmap)
|
|
|
|
map_offset = vmem_altmap_offset(altmap);
|
|
|
|
} else {
|
|
|
|
resource_size_t start, size;
|
|
|
|
|
|
|
|
start = phys_start_pfn << PAGE_SHIFT;
|
|
|
|
size = nr_pages * PAGE_SIZE;
|
|
|
|
|
|
|
|
ret = release_mem_region_adjustable(&iomem_resource, start,
|
|
|
|
size);
|
|
|
|
if (ret) {
|
|
|
|
resource_size_t endres = start + size - 1;
|
|
|
|
|
|
|
|
pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
|
|
|
|
&start, &endres, ret);
|
|
|
|
}
|
|
|
|
}
|
2008-04-28 17:12:01 +08:00
|
|
|
|
2016-03-16 05:57:51 +08:00
|
|
|
clear_zone_contiguous(zone);
|
|
|
|
|
2008-04-28 17:12:01 +08:00
|
|
|
/*
|
|
|
|
* We can only remove entire sections
|
|
|
|
*/
|
|
|
|
BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
|
|
|
|
BUG_ON(nr_pages % PAGES_PER_SECTION);
|
|
|
|
|
|
|
|
sections_to_remove = nr_pages / PAGES_PER_SECTION;
|
|
|
|
for (i = 0; i < sections_to_remove; i++) {
|
|
|
|
unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
|
2016-01-16 08:56:22 +08:00
|
|
|
|
2018-11-03 06:48:46 +08:00
|
|
|
cond_resched();
|
2017-12-29 15:53:56 +08:00
|
|
|
ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
|
|
|
|
altmap);
|
2016-01-16 08:56:22 +08:00
|
|
|
map_offset = 0;
|
2008-04-28 17:12:01 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
2016-03-16 05:57:51 +08:00
|
|
|
|
|
|
|
set_zone_contiguous(zone);
|
|
|
|
|
2008-04-28 17:12:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2013-04-30 06:08:22 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|
2008-04-28 17:12:01 +08:00
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
int set_online_page_callback(online_page_callback_t callback)
|
|
|
|
{
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
get_online_mems();
|
|
|
|
mutex_lock(&online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
if (online_page_callback == generic_online_page) {
|
|
|
|
online_page_callback = callback;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mutex_unlock(&online_page_callback_lock);
|
|
|
|
put_online_mems();
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(set_online_page_callback);
|
|
|
|
|
|
|
|
int restore_online_page_callback(online_page_callback_t callback)
|
|
|
|
{
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
get_online_mems();
|
|
|
|
mutex_lock(&online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
if (online_page_callback == callback) {
|
|
|
|
online_page_callback = generic_online_page;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mutex_unlock(&online_page_callback_lock);
|
|
|
|
put_online_mems();
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(restore_online_page_callback);
|
|
|
|
|
|
|
|
void __online_page_set_limits(struct page *page)
|
2008-04-28 17:12:03 +08:00
|
|
|
{
|
2011-07-26 08:12:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__online_page_set_limits);
|
|
|
|
|
|
|
|
void __online_page_increment_counters(struct page *page)
|
|
|
|
{
|
2013-07-04 06:03:21 +08:00
|
|
|
adjust_managed_page_count(page, 1);
|
2011-07-26 08:12:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__online_page_increment_counters);
|
2008-04-28 17:12:03 +08:00
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
void __online_page_free(struct page *page)
|
|
|
|
{
|
2013-07-04 06:03:21 +08:00
|
|
|
__free_reserved_page(page);
|
2008-04-28 17:12:03 +08:00
|
|
|
}
|
2011-07-26 08:12:05 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__online_page_free);
|
|
|
|
|
|
|
|
static void generic_online_page(struct page *page)
|
|
|
|
{
|
|
|
|
__online_page_set_limits(page);
|
|
|
|
__online_page_increment_counters(page);
|
|
|
|
__online_page_free(page);
|
|
|
|
}
|
2008-04-28 17:12:03 +08:00
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
|
|
|
|
void *arg)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
unsigned long i;
|
2007-10-16 16:26:10 +08:00
|
|
|
unsigned long onlined_pages = *(unsigned long *)arg;
|
|
|
|
struct page *page;
|
2017-07-07 06:37:56 +08:00
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
if (PageReserved(pfn_to_page(start_pfn)))
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
page = pfn_to_page(start_pfn + i);
|
2011-07-26 08:12:05 +08:00
|
|
|
(*online_page_callback)(page);
|
2007-10-16 16:26:10 +08:00
|
|
|
onlined_pages++;
|
|
|
|
}
|
2017-07-07 06:37:56 +08:00
|
|
|
|
|
|
|
online_mem_sections(start_pfn, start_pfn + nr_pages);
|
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
*(unsigned long *)arg = onlined_pages;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/* check which state of node_states will be changed when online memory */
|
|
|
|
static void node_states_check_changes_online(unsigned long nr_pages,
|
|
|
|
struct zone *zone, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
int nid = zone_to_nid(zone);
|
|
|
|
|
2018-10-27 06:07:34 +08:00
|
|
|
arg->status_change_nid = -1;
|
|
|
|
arg->status_change_nid_normal = -1;
|
|
|
|
arg->status_change_nid_high = -1;
|
2012-12-12 08:01:03 +08:00
|
|
|
|
2018-10-27 06:07:34 +08:00
|
|
|
if (!node_state(nid, N_MEMORY))
|
|
|
|
arg->status_change_nid = nid;
|
|
|
|
if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
|
2012-12-12 08:01:03 +08:00
|
|
|
arg->status_change_nid_normal = nid;
|
2012-12-13 05:51:49 +08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
2018-10-27 06:07:34 +08:00
|
|
|
if (zone_idx(zone) <= N_HIGH_MEMORY && !node_state(nid, N_HIGH_MEMORY))
|
2012-12-13 05:51:49 +08:00
|
|
|
arg->status_change_nid_high = nid;
|
|
|
|
#endif
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void node_states_set_node(int node, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
if (arg->status_change_nid_normal >= 0)
|
|
|
|
node_set_state(node, N_NORMAL_MEMORY);
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
if (arg->status_change_nid_high >= 0)
|
|
|
|
node_set_state(node, N_HIGH_MEMORY);
|
|
|
|
|
2018-10-27 06:07:25 +08:00
|
|
|
if (arg->status_change_nid >= 0)
|
|
|
|
node_set_state(node, N_MEMORY);
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
unsigned long old_end_pfn = zone_end_pfn(zone);
|
|
|
|
|
|
|
|
if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
|
|
|
|
zone->zone_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
|
|
|
|
|
|
|
|
if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
|
|
|
|
pgdat->node_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
|
|
|
|
}
|
|
|
|
|
2017-12-29 15:53:57 +08:00
|
|
|
void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages, struct vmem_altmap *altmap)
|
2017-07-07 06:38:11 +08:00
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
int nid = pgdat->node_id;
|
|
|
|
unsigned long flags;
|
2016-07-27 06:22:23 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
clear_zone_contiguous(zone);
|
|
|
|
|
|
|
|
/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
|
|
|
|
pgdat_resize_lock(pgdat, &flags);
|
|
|
|
zone_span_writelock(zone);
|
2018-12-28 16:37:10 +08:00
|
|
|
if (zone_is_empty(zone))
|
|
|
|
init_currently_empty_zone(zone, start_pfn, nr_pages);
|
2017-07-07 06:38:11 +08:00
|
|
|
resize_zone_range(zone, start_pfn, nr_pages);
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
resize_pgdat_range(pgdat, start_pfn, nr_pages);
|
|
|
|
pgdat_resize_unlock(pgdat, &flags);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TODO now we have a visible range of pages which are not associated
|
|
|
|
* with their zone properly. Not nice but set_pfnblock_flags_mask
|
|
|
|
* expects the zone spans the pfn range. All the pages in the range
|
|
|
|
* are reserved so nobody should be touching them so we should be safe
|
|
|
|
*/
|
2017-12-29 15:53:57 +08:00
|
|
|
memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
|
|
|
|
MEMMAP_HOTPLUG, altmap);
|
2017-07-07 06:38:11 +08:00
|
|
|
|
|
|
|
set_zone_contiguous(zone);
|
|
|
|
}
|
|
|
|
|
2017-07-07 06:38:18 +08:00
|
|
|
/*
|
|
|
|
* Returns a default kernel memory zone for the given pfn range.
|
|
|
|
* If no kernel zone covers this pfn range it will automatically go
|
|
|
|
* to the ZONE_NORMAL.
|
|
|
|
*/
|
2017-09-07 07:19:40 +08:00
|
|
|
static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn,
|
2017-07-07 06:38:18 +08:00
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = NODE_DATA(nid);
|
|
|
|
int zid;
|
|
|
|
|
|
|
|
for (zid = 0; zid <= ZONE_NORMAL; zid++) {
|
|
|
|
struct zone *zone = &pgdat->node_zones[zid];
|
|
|
|
|
|
|
|
if (zone_intersects(zone, start_pfn, nr_pages))
|
|
|
|
return zone;
|
|
|
|
}
|
|
|
|
|
|
|
|
return &pgdat->node_zones[ZONE_NORMAL];
|
|
|
|
}
|
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
|
|
|
|
unsigned long nr_pages)
|
2017-09-07 07:19:37 +08:00
|
|
|
{
|
2017-09-07 07:19:40 +08:00
|
|
|
struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
|
|
|
|
nr_pages);
|
|
|
|
struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
|
|
|
|
bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
|
|
|
|
bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
|
2017-09-07 07:19:37 +08:00
|
|
|
|
|
|
|
/*
|
2017-09-07 07:19:40 +08:00
|
|
|
* We inherit the existing zone in a simple case where zones do not
|
|
|
|
* overlap in the given range
|
2017-09-07 07:19:37 +08:00
|
|
|
*/
|
2017-09-07 07:19:40 +08:00
|
|
|
if (in_kernel ^ in_movable)
|
|
|
|
return (in_kernel) ? kernel_zone : movable_zone;
|
2017-07-11 06:48:37 +08:00
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
/*
|
|
|
|
* If the range doesn't belong to any zone or two zones overlap in the
|
|
|
|
* given range then we use movable zone only if movable_node is
|
|
|
|
* enabled because we always online to a kernel zone by default.
|
|
|
|
*/
|
|
|
|
return movable_node_enabled ? movable_zone : kernel_zone;
|
2017-07-11 06:48:37 +08:00
|
|
|
}
|
|
|
|
|
2017-09-07 07:19:37 +08:00
|
|
|
struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
|
|
|
|
unsigned long nr_pages)
|
2017-07-07 06:38:11 +08:00
|
|
|
{
|
2017-09-07 07:19:40 +08:00
|
|
|
if (online_type == MMOP_ONLINE_KERNEL)
|
|
|
|
return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
|
2017-07-07 06:38:11 +08:00
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
if (online_type == MMOP_ONLINE_MOVABLE)
|
|
|
|
return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
|
2016-07-27 06:22:23 +08:00
|
|
|
|
2017-09-07 07:19:40 +08:00
|
|
|
return default_zone_for_pfn(nid, start_pfn, nr_pages);
|
2017-09-07 07:19:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Associates the given pfn range with the given node and the zone appropriate
|
|
|
|
* for the given online type.
|
|
|
|
*/
|
|
|
|
static struct zone * __meminit move_pfn_range(int online_type, int nid,
|
|
|
|
unsigned long start_pfn, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct zone *zone;
|
|
|
|
|
|
|
|
zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
|
2017-12-29 15:53:57 +08:00
|
|
|
move_pfn_range_to_zone(zone, start_pfn, nr_pages, NULL);
|
2017-07-07 06:38:11 +08:00
|
|
|
return zone;
|
2016-07-27 06:22:23 +08:00
|
|
|
}
|
2007-10-16 16:26:10 +08:00
|
|
|
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
|
2007-10-16 16:26:10 +08:00
|
|
|
{
|
2013-07-04 06:02:10 +08:00
|
|
|
unsigned long flags;
|
2005-10-30 09:16:54 +08:00
|
|
|
unsigned long onlined_pages = 0;
|
|
|
|
struct zone *zone;
|
2006-06-23 17:03:11 +08:00
|
|
|
int need_zonelists_rebuild = 0;
|
2007-10-22 07:41:36 +08:00
|
|
|
int nid;
|
|
|
|
int ret;
|
|
|
|
struct memory_notify arg;
|
2018-04-06 07:23:00 +08:00
|
|
|
struct memory_block *mem;
|
|
|
|
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
mem_hotplug_begin();
|
|
|
|
|
2018-04-06 07:23:00 +08:00
|
|
|
/*
|
|
|
|
* We can't use pfn_to_nid() because nid might be stored in struct page
|
|
|
|
* which is not yet initialized. Instead, we find nid from memory block.
|
|
|
|
*/
|
|
|
|
mem = find_memory_block(__pfn_to_section(pfn));
|
|
|
|
nid = mem->nid;
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2017-07-07 06:38:11 +08:00
|
|
|
/* associate pfn range with the zone */
|
|
|
|
zone = move_pfn_range(online_type, nid, pfn, nr_pages);
|
|
|
|
|
2007-10-22 07:41:36 +08:00
|
|
|
arg.start_pfn = pfn;
|
|
|
|
arg.nr_pages = nr_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_check_changes_online(nr_pages, zone, &arg);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
ret = memory_notify(MEM_GOING_ONLINE, &arg);
|
|
|
|
ret = notifier_to_errno(ret);
|
2016-03-18 05:19:35 +08:00
|
|
|
if (ret)
|
|
|
|
goto failed_addition;
|
|
|
|
|
2006-06-23 17:03:11 +08:00
|
|
|
/*
|
|
|
|
* If this zone is not populated, then it is not in zonelist.
|
|
|
|
* This means the page allocator ignores this zone.
|
|
|
|
* So, zonelist must be updated after online.
|
|
|
|
*/
|
2012-12-12 08:01:01 +08:00
|
|
|
if (!populated_zone(zone)) {
|
2006-06-23 17:03:11 +08:00
|
|
|
need_zonelists_rebuild = 1;
|
2017-09-07 07:20:24 +08:00
|
|
|
setup_zone_pageset(zone);
|
2012-12-12 08:01:01 +08:00
|
|
|
}
|
2006-06-23 17:03:11 +08:00
|
|
|
|
2009-09-23 07:45:46 +08:00
|
|
|
ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
|
2007-10-16 16:26:10 +08:00
|
|
|
online_pages_range);
|
2008-05-15 07:05:50 +08:00
|
|
|
if (ret) {
|
2012-12-12 08:01:01 +08:00
|
|
|
if (need_zonelists_rebuild)
|
|
|
|
zone_pcp_reset(zone);
|
2016-03-18 05:19:35 +08:00
|
|
|
goto failed_addition;
|
2008-05-15 07:05:50 +08:00
|
|
|
}
|
|
|
|
|
2005-10-30 09:16:54 +08:00
|
|
|
zone->present_pages += onlined_pages;
|
2013-07-04 06:02:10 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
2006-03-10 09:33:51 +08:00
|
|
|
zone->zone_pgdat->node_present_pages += onlined_pages;
|
2013-07-04 06:02:10 +08:00
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
|
|
|
|
2012-08-01 07:43:30 +08:00
|
|
|
if (onlined_pages) {
|
2016-03-18 05:18:12 +08:00
|
|
|
node_states_set_node(nid, &arg);
|
2012-08-01 07:43:30 +08:00
|
|
|
if (need_zonelists_rebuild)
|
2017-09-07 07:20:24 +08:00
|
|
|
build_all_zonelists(NULL);
|
2012-08-01 07:43:30 +08:00
|
|
|
else
|
|
|
|
zone_pcp_update(zone);
|
|
|
|
}
|
2005-10-30 09:16:54 +08:00
|
|
|
|
2011-05-25 08:11:32 +08:00
|
|
|
init_per_zone_wmark_min();
|
|
|
|
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
if (onlined_pages) {
|
2016-03-18 05:18:12 +08:00
|
|
|
kswapd_run(nid);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
kcompactd_run(nid);
|
|
|
|
}
|
2005-10-30 09:16:56 +08:00
|
|
|
|
2010-05-25 05:32:51 +08:00
|
|
|
vm_total_pages = nr_free_pagecache_pages();
|
2008-07-24 12:28:18 +08:00
|
|
|
|
2006-09-29 17:01:25 +08:00
|
|
|
writeback_set_ratelimit();
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
if (onlined_pages)
|
|
|
|
memory_notify(MEM_ONLINE, &arg);
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
mem_hotplug_done();
|
2015-04-15 06:45:11 +08:00
|
|
|
return 0;
|
2016-03-18 05:19:35 +08:00
|
|
|
|
|
|
|
failed_addition:
|
|
|
|
pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
|
|
|
|
(unsigned long long) pfn << PAGE_SHIFT,
|
|
|
|
(((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
|
|
|
|
memory_notify(MEM_CANCEL_ONLINE, &arg);
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
mem_hotplug_done();
|
2016-03-18 05:19:35 +08:00
|
|
|
return ret;
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
2006-10-01 14:27:08 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
|
2006-06-27 17:53:30 +08:00
|
|
|
|
2014-11-14 07:19:41 +08:00
|
|
|
static void reset_node_present_pages(pg_data_t *pgdat)
|
|
|
|
{
|
|
|
|
struct zone *z;
|
|
|
|
|
|
|
|
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
|
|
|
|
z->present_pages = 0;
|
|
|
|
|
|
|
|
pgdat->node_present_pages = 0;
|
|
|
|
}
|
|
|
|
|
2009-11-18 06:06:18 +08:00
|
|
|
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
|
|
|
|
static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
{
|
|
|
|
struct pglist_data *pgdat;
|
2014-06-05 07:07:51 +08:00
|
|
|
unsigned long start_pfn = PFN_DOWN(start);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
pgdat = NODE_DATA(nid);
|
|
|
|
if (!pgdat) {
|
|
|
|
pgdat = arch_alloc_nodedata(nid);
|
|
|
|
if (!pgdat)
|
|
|
|
return NULL;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
arch_refresh_nodedata(nid, pgdat);
|
mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:
BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0
The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.
process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer access
So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-26 06:55:20 +08:00
|
|
|
} else {
|
mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken
for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
this zone will be 0.
However, the decision on whether to sleep is made on
kswapd_classzone_idx which is 0 without a recent wakeup request and that
classzone does not account for lowmem reserves. This allows kswapd to
sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
request even if a stream of allocations cannot use that zone. While
kswapd may be woken again shortly in the near future there are two
consequences -- the pgdat bits that control congestion are cleared
prematurely and direct reclaim is more likely as kswapd slept
prematurely.
This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
invalid index) when there has been no recent wakeups. If there are no
wakeups, it'll decide whether to sleep based on the highest possible
zone available (MAX_NR_ZONES - 1). It then becomes critical that the
"pgdat balanced" decisions during reclaim and when deciding to sleep are
the same. If there is a mismatch, kswapd can stay awake continually
trying to balance tiny zones.
simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent
4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)
With this patch on top, write and allocation latencies are massively
improved. The read latencies are slightly impaired but it's worth
noting that this is mostly due to the IO scheduler and not directly
related to reclaim. The vmstats are a bit of a mix but the relevant
ones are as follows;
4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0 0 0
Swap Outs 0 608 0
Direct pages scanned 6910672 3132699 6357298
Kswapd pages scanned 57036946 82488665 56986286
Kswapd pages reclaimed 55993488 63474329 55939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency 98% 76% 98%
Kswapd velocity 12494.375 17597.507 12488.065
Direct efficiency 99% 94% 99%
Direct velocity 1513.835 668.306 1393.148
Page writes by reclaim 0.000 4410243.000 0.000
Page writes file 0 4409635 0
Page writes anon 0 608 0
Page reclaim immediate 1036792 14175203 1042571
4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Swap Ins 0 12 0
Swap Outs 0 838 0
Direct pages scanned 6579706 3237270 6256811
Kswapd pages scanned 61853702 79961486 54837791
Kswapd pages reclaimed 60768764 60755788 53849586
Direct pages reclaimed 6579055 2987453 6256151
Kswapd efficiency 98% 75% 98%
Page writes by reclaim 0.000 4389496.000 0.000
Page writes file 0 4388658 0
Page writes anon 0 838 0
Page reclaim immediate 1073573 14473009 982507
Swap-outs are equivalent to baseline.
Direct reclaim is reduced but not eliminated. It's worth noting that
there are two periods of direct reclaim for this workload. The first is
when it switches from preparing the files for the actual test itself.
It's a lot of file IO followed by a lot of allocs that reclaims heavily
for a brief window. While direct reclaim is lower with clear-v2, it is
due to kswapd scanning aggressively and trying to reclaim the world
which is not the right thing to do. With the patches applied, there is
still direct reclaim but the phase change from "creating work files" to
starting multiple threads that allocate a lot of anonymous memory faster
than kswapd can reclaim.
Scanning/reclaim efficiency is restored by this patch.
Page writes from reclaim context are back at 0 which is ideal.
Pages immediately reclaimed after IO completes is slightly improved but
it is expected this will vary slightly.
On UMA, there is almost no change so this is not expected to be a
universal win.
[mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04 05:53:45 +08:00
|
|
|
/*
|
|
|
|
* Reset the nr_zones, order and classzone_idx before reuse.
|
|
|
|
* Note that kswapd will init kswapd_classzone_idx properly
|
|
|
|
* when it starts in the near future.
|
|
|
|
*/
|
mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:
BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0
The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.
process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer access
So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-26 06:55:20 +08:00
|
|
|
pgdat->nr_zones = 0;
|
2016-07-29 06:45:49 +08:00
|
|
|
pgdat->kswapd_order = 0;
|
|
|
|
pgdat->kswapd_classzone_idx = 0;
|
2013-02-23 08:33:18 +08:00
|
|
|
}
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
|
|
|
/* we can use NODE_DATA(nid) from here */
|
|
|
|
|
mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.
free_area_init_core() performs the following actions:
1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone
When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.
Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.
Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
This patch does two things:
First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:
1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data
These two functions are: pgdat_init_internals() and zone_init_internals().
The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():
Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.
Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.
Actually, we re-set these values to 0 later on with the calls to:
reset_node_managed_pages()
reset_node_present_pages()
The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.
[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 12:53:43 +08:00
|
|
|
pgdat->node_id = nid;
|
|
|
|
pgdat->node_start_pfn = start_pfn;
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
/* init node's zones as empty zones, we don't have any present pages.*/
|
mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.
free_area_init_core() performs the following actions:
1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone
When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.
Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.
Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
This patch does two things:
First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:
1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data
These two functions are: pgdat_init_internals() and zone_init_internals().
The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():
Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.
Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.
Actually, we re-set these values to 0 later on with the calls to:
reset_node_managed_pages()
reset_node_present_pages()
The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.
[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 12:53:43 +08:00
|
|
|
free_area_init_core_hotplug(nid);
|
mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
The following oops occurs after a pgdat is hotadded:
Unable to handle kernel paging request for data at address 0x00c30001
Faulting instruction address: 0xc00000000022f8f4
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
task: c000000000ef3080 task.stack: c000000000f6c000
NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 84002028 XER: 20000000
CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
NIP refresh_cpu_vm_stats+0x1a4/0x2f0
LR refresh_cpu_vm_stats+0x1f8/0x2f0
Call Trace:
refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)
Add per_cpu_nodestats initialization to the hotplug codepath.
Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-12 06:33:12 +08:00
|
|
|
pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2011-06-16 06:08:38 +08:00
|
|
|
/*
|
|
|
|
* The node we allocated has no zone fallback lists. For avoiding
|
|
|
|
* to access not-initialized zonelist, build here.
|
|
|
|
*/
|
2017-09-07 07:20:24 +08:00
|
|
|
build_all_zonelists(pgdat);
|
2011-06-16 06:08:38 +08:00
|
|
|
|
2014-11-14 07:19:41 +08:00
|
|
|
/*
|
|
|
|
* When memory is hot-added, all the memory is in offline state. So
|
|
|
|
* clear all zones' present_pages because they will be updated in
|
|
|
|
* online_pages() and offline_pages().
|
|
|
|
*/
|
mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.
free_area_init_core() performs the following actions:
1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone
When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.
Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.
Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
This patch does two things:
First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:
1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data
These two functions are: pgdat_init_internals() and zone_init_internals().
The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():
Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.
Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.
Actually, we re-set these values to 0 later on with the calls to:
reset_node_managed_pages()
reset_node_present_pages()
The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.
[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 12:53:43 +08:00
|
|
|
reset_node_managed_pages(pgdat);
|
2014-11-14 07:19:41 +08:00
|
|
|
reset_node_present_pages(pgdat);
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
return pgdat;
|
|
|
|
}
|
|
|
|
|
2018-08-18 06:46:15 +08:00
|
|
|
static void rollback_node_hotadd(int nid)
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
{
|
2018-08-18 06:46:15 +08:00
|
|
|
pg_data_t *pgdat = NODE_DATA(nid);
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
arch_refresh_nodedata(nid, NULL);
|
mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
The following oops occurs after a pgdat is hotadded:
Unable to handle kernel paging request for data at address 0x00c30001
Faulting instruction address: 0xc00000000022f8f4
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
task: c000000000ef3080 task.stack: c000000000f6c000
NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 84002028 XER: 20000000
CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
NIP refresh_cpu_vm_stats+0x1a4/0x2f0
LR refresh_cpu_vm_stats+0x1f8/0x2f0
Call Trace:
refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)
Add per_cpu_nodestats initialization to the hotplug codepath.
Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-12 06:33:12 +08:00
|
|
|
free_percpu(pgdat->per_cpu_nodestats);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
arch_free_nodedata(pgdat);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-06-27 17:53:35 +08:00
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
/**
|
|
|
|
* try_online_node - online a node if offlined
|
2018-04-06 07:24:57 +08:00
|
|
|
* @nid: the node ID
|
2018-08-18 06:46:15 +08:00
|
|
|
* @start: start addr of the node
|
|
|
|
* @set_node_online: Whether we want to online the node
|
2010-05-25 05:32:41 +08:00
|
|
|
* called by cpu_up() to online a node without onlined memory.
|
2018-08-18 06:46:15 +08:00
|
|
|
*
|
|
|
|
* Returns:
|
|
|
|
* 1 -> a new node has been allocated
|
|
|
|
* 0 -> the node is already online
|
|
|
|
* -ENOMEM -> the node could not be allocated
|
2010-05-25 05:32:41 +08:00
|
|
|
*/
|
2018-08-18 06:46:15 +08:00
|
|
|
static int __try_online_node(int nid, u64 start, bool set_node_online)
|
2010-05-25 05:32:41 +08:00
|
|
|
{
|
2018-08-18 06:46:15 +08:00
|
|
|
pg_data_t *pgdat;
|
|
|
|
int ret = 1;
|
2010-05-25 05:32:41 +08:00
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
if (node_online(nid))
|
|
|
|
return 0;
|
|
|
|
|
2018-08-18 06:46:15 +08:00
|
|
|
pgdat = hotadd_new_pgdat(nid, start);
|
2011-06-23 09:13:01 +08:00
|
|
|
if (!pgdat) {
|
2013-11-13 07:07:25 +08:00
|
|
|
pr_err("Cannot online node %d due to NULL pgdat\n", nid);
|
2010-05-25 05:32:41 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2018-08-18 06:46:15 +08:00
|
|
|
|
|
|
|
if (set_node_online) {
|
|
|
|
node_set_online(nid);
|
|
|
|
ret = register_one_node(nid);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
2010-05-25 05:32:41 +08:00
|
|
|
out:
|
2018-08-18 06:46:15 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Users of this function always want to online/register the node
|
|
|
|
*/
|
|
|
|
int try_online_node(int nid)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
mem_hotplug_begin();
|
|
|
|
ret = __try_online_node(nid, 0, true);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2010-05-25 05:32:41 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
static int check_hotplug_memory_range(u64 start, u64 size)
|
|
|
|
{
|
mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:22:39 +08:00
|
|
|
unsigned long block_sz = memory_block_size_bytes();
|
|
|
|
u64 block_nr_pages = block_sz >> PAGE_SHIFT;
|
2013-09-12 05:21:49 +08:00
|
|
|
u64 nr_pages = size >> PAGE_SHIFT;
|
mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:22:39 +08:00
|
|
|
u64 start_pfn = PFN_DOWN(start);
|
2013-09-12 05:21:49 +08:00
|
|
|
|
mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.
This patchset:
- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory
- Also, potentially improves boot performance by eliminating condition
from __init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
This patch (of 6):
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.
Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:22:39 +08:00
|
|
|
/* memory range must be block size aligned */
|
|
|
|
if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
|
|
|
|
!IS_ALIGNED(nr_pages, block_nr_pages)) {
|
|
|
|
pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
|
|
|
|
block_sz, start, size);
|
2013-09-12 05:21:49 +08:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
static int online_memory_block(struct memory_block *mem, void *arg)
|
|
|
|
{
|
2017-02-25 07:00:02 +08:00
|
|
|
return device_online(&mem->dev);
|
2016-03-16 05:56:48 +08:00
|
|
|
}
|
|
|
|
|
2018-10-31 06:10:24 +08:00
|
|
|
/*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations (triggered e.g. by sysfs).
|
|
|
|
*
|
|
|
|
* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
|
|
|
|
*/
|
2018-12-28 16:35:36 +08:00
|
|
|
int __ref add_memory_resource(int nid, struct resource *res)
|
2006-06-27 17:53:30 +08:00
|
|
|
{
|
2015-06-25 23:35:49 +08:00
|
|
|
u64 start, size;
|
2018-08-18 06:46:15 +08:00
|
|
|
bool new_node = false;
|
2006-06-27 17:53:30 +08:00
|
|
|
int ret;
|
|
|
|
|
2015-06-25 23:35:49 +08:00
|
|
|
start = res->start;
|
|
|
|
size = resource_size(res);
|
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
ret = check_hotplug_memory_range(start, size);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2014-01-24 07:53:26 +08:00
|
|
|
|
2015-09-05 06:42:32 +08:00
|
|
|
/*
|
|
|
|
* Add new range to memblock so that when hotadd_new_pgdat() is called
|
|
|
|
* to allocate new pgdat, get_pfn_range_for_nid() will be able to find
|
|
|
|
* this new range and calculate total pages correctly. The range will
|
|
|
|
* be removed at hot-remove time.
|
|
|
|
*/
|
|
|
|
memblock_add_node(start, size, nid);
|
|
|
|
|
2018-08-18 06:46:15 +08:00
|
|
|
ret = __try_online_node(nid, start, false);
|
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
new_node = ret;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2006-06-27 17:53:30 +08:00
|
|
|
/* call arch's memory hotadd */
|
2017-12-29 15:53:53 +08:00
|
|
|
ret = arch_add_memory(nid, start, size, NULL, true);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
if (new_node) {
|
2018-08-18 06:46:18 +08:00
|
|
|
/* If sysfs file of new node can't be created, cpu on the node
|
2006-06-27 17:53:38 +08:00
|
|
|
* can't be hot-added. There is no rollback way now.
|
|
|
|
* So, check by BUG_ON() to catch it reluctantly..
|
2018-08-18 06:46:18 +08:00
|
|
|
* We online node here. We can't roll back from here.
|
2006-06-27 17:53:38 +08:00
|
|
|
*/
|
2018-08-18 06:46:18 +08:00
|
|
|
node_set_online(nid);
|
|
|
|
ret = __register_one_node(nid);
|
2006-06-27 17:53:38 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
2018-08-18 06:46:18 +08:00
|
|
|
/* link memory sections under this node.*/
|
2018-08-18 06:46:22 +08:00
|
|
|
ret = link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1));
|
2018-08-18 06:46:18 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
|
2010-03-06 05:41:58 +08:00
|
|
|
/* create new memmap entry */
|
|
|
|
firmware_map_add_hotplug(start, start + size, "System RAM");
|
|
|
|
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
/* device_online() will take the lock when calling online_pages() */
|
|
|
|
mem_hotplug_done();
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
/* online pages if requested */
|
2018-12-28 16:35:36 +08:00
|
|
|
if (memhp_auto_online)
|
2016-03-16 05:56:48 +08:00
|
|
|
walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
|
|
|
|
NULL, online_memory_block);
|
|
|
|
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
return ret;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
error:
|
|
|
|
/* rollback pgdat allocation and others */
|
2018-08-18 06:46:15 +08:00
|
|
|
if (new_node)
|
|
|
|
rollback_node_hotadd(nid);
|
2015-09-05 06:42:32 +08:00
|
|
|
memblock_remove(start, size);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2006-06-27 17:53:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2015-06-25 23:35:49 +08:00
|
|
|
|
2018-10-31 06:10:24 +08:00
|
|
|
/* requires device_hotplug_lock, see add_memory_resource() */
|
|
|
|
int __ref __add_memory(int nid, u64 start, u64 size)
|
2015-06-25 23:35:49 +08:00
|
|
|
{
|
|
|
|
struct resource *res;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
res = register_memory_resource(start, size);
|
2016-01-15 07:21:55 +08:00
|
|
|
if (IS_ERR(res))
|
|
|
|
return PTR_ERR(res);
|
2015-06-25 23:35:49 +08:00
|
|
|
|
2018-12-28 16:35:36 +08:00
|
|
|
ret = add_memory_resource(nid, res);
|
2015-06-25 23:35:49 +08:00
|
|
|
if (ret < 0)
|
|
|
|
release_memory_resource(res);
|
|
|
|
return ret;
|
|
|
|
}
|
2018-10-31 06:10:24 +08:00
|
|
|
|
|
|
|
int add_memory(int nid, u64 start, u64 size)
|
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
lock_device_hotplug();
|
|
|
|
rc = __add_memory(nid, start, size);
|
|
|
|
unlock_device_hotplug();
|
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
2006-06-27 17:53:30 +08:00
|
|
|
EXPORT_SYMBOL_GPL(add_memory);
|
2007-10-16 16:26:12 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2008-07-24 12:28:19 +08:00
|
|
|
/*
|
|
|
|
* A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
|
|
|
|
* set and the size of the free page is given by page_order(). Using this,
|
|
|
|
* the function determines if the pageblock contains only free pages.
|
|
|
|
* Due to buddy contraints, a free page at least the size of a pageblock will
|
|
|
|
* be located at the start of the pageblock
|
|
|
|
*/
|
|
|
|
static inline int pageblock_free(struct page *page)
|
|
|
|
{
|
|
|
|
return PageBuddy(page) && page_order(page) >= pageblock_order;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Return the start of the next active pageblock after a given page */
|
|
|
|
static struct page *next_active_pageblock(struct page *page)
|
|
|
|
{
|
|
|
|
/* Ensure the starting page is pageblock-aligned */
|
|
|
|
BUG_ON(page_to_pfn(page) & (pageblock_nr_pages - 1));
|
|
|
|
|
|
|
|
/* If the entire pageblock is free, move to the end of free page */
|
2010-09-10 07:38:01 +08:00
|
|
|
if (pageblock_free(page)) {
|
|
|
|
int order;
|
|
|
|
/* be careful. we don't have locks, page_order can be changed.*/
|
|
|
|
order = page_order(page);
|
|
|
|
if ((order < MAX_ORDER) && (order >= pageblock_order))
|
|
|
|
return page + (1 << order);
|
|
|
|
}
|
2008-07-24 12:28:19 +08:00
|
|
|
|
2010-09-10 07:38:01 +08:00
|
|
|
return page + pageblock_nr_pages;
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
2018-06-08 08:07:43 +08:00
|
|
|
static bool is_pageblock_removable_nolock(struct page *page)
|
|
|
|
{
|
|
|
|
struct zone *zone;
|
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to be careful here because we are iterating over memory
|
|
|
|
* sections which are not zone aware so we might end up outside of
|
|
|
|
* the zone but still within the section.
|
|
|
|
* We have to take care about the node as well. If the node is offline
|
|
|
|
* its NODE_DATA will be NULL - see page_zone.
|
|
|
|
*/
|
|
|
|
if (!node_online(page_to_nid(page)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
zone = page_zone(page);
|
|
|
|
pfn = page_to_pfn(page);
|
|
|
|
if (!zone_spans_pfn(zone, pfn))
|
|
|
|
return false;
|
|
|
|
|
2018-12-28 16:33:56 +08:00
|
|
|
return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, SKIP_HWPOISON);
|
2018-06-08 08:07:43 +08:00
|
|
|
}
|
|
|
|
|
2008-07-24 12:28:19 +08:00
|
|
|
/* Checks if this range of memory is likely to be hot-removable. */
|
2016-05-20 08:11:26 +08:00
|
|
|
bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
|
2008-07-24 12:28:19 +08:00
|
|
|
{
|
|
|
|
struct page *page = pfn_to_page(start_pfn);
|
|
|
|
struct page *end_page = page + nr_pages;
|
|
|
|
|
|
|
|
/* Check the starting page of each pageblock within the range */
|
|
|
|
for (; page < end_page; page = next_active_pageblock(page)) {
|
2010-10-27 05:21:30 +08:00
|
|
|
if (!is_pageblock_removable_nolock(page))
|
2016-05-20 08:11:26 +08:00
|
|
|
return false;
|
2010-10-27 05:21:30 +08:00
|
|
|
cond_resched();
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* All pageblocks in the memory block are likely to be hot-removable */
|
2016-05-20 08:11:26 +08:00
|
|
|
return true;
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/*
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
* Confirm all pages in a range [start, end) belong to the same zone.
|
2017-02-04 05:13:23 +08:00
|
|
|
* When true, return its valid [start, end).
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2017-02-04 05:13:23 +08:00
|
|
|
int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
|
|
|
|
unsigned long *valid_start, unsigned long *valid_end)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
2015-12-30 06:54:25 +08:00
|
|
|
unsigned long pfn, sec_end_pfn;
|
2017-02-04 05:13:23 +08:00
|
|
|
unsigned long start, end;
|
2007-10-16 16:26:12 +08:00
|
|
|
struct zone *zone = NULL;
|
|
|
|
struct page *page;
|
|
|
|
int i;
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1);
|
2007-10-16 16:26:12 +08:00
|
|
|
pfn < end_pfn;
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) {
|
2015-12-30 06:54:25 +08:00
|
|
|
/* Make sure the memory section is present first */
|
|
|
|
if (!present_section_nr(pfn_to_section_nr(pfn)))
|
2007-10-16 16:26:12 +08:00
|
|
|
continue;
|
2015-12-30 06:54:25 +08:00
|
|
|
for (; pfn < sec_end_pfn && pfn < end_pfn;
|
|
|
|
pfn += MAX_ORDER_NR_PAGES) {
|
|
|
|
i = 0;
|
|
|
|
/* This is just a CONFIG_HOLES_IN_ZONE check.*/
|
|
|
|
while ((i < MAX_ORDER_NR_PAGES) &&
|
|
|
|
!pfn_valid_within(pfn + i))
|
|
|
|
i++;
|
2017-02-25 06:59:30 +08:00
|
|
|
if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
|
2015-12-30 06:54:25 +08:00
|
|
|
continue;
|
|
|
|
page = pfn_to_page(pfn + i);
|
|
|
|
if (zone && page_zone(page) != zone)
|
|
|
|
return 0;
|
2017-02-04 05:13:23 +08:00
|
|
|
if (!zone)
|
|
|
|
start = pfn + i;
|
2015-12-30 06:54:25 +08:00
|
|
|
zone = page_zone(page);
|
2017-02-04 05:13:23 +08:00
|
|
|
end = pfn + MAX_ORDER_NR_PAGES;
|
2015-12-30 06:54:25 +08:00
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
|
2017-02-04 05:13:23 +08:00
|
|
|
if (zone) {
|
|
|
|
*valid_start = start;
|
2017-02-25 06:59:30 +08:00
|
|
|
*valid_end = min(end, end_pfn);
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
return 1;
|
2017-02-04 05:13:23 +08:00
|
|
|
} else {
|
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.
Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.
Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).
Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.
So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'
This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.
Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.
Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-04 05:13:20 +08:00
|
|
|
return 0;
|
2017-02-04 05:13:23 +08:00
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-02-25 06:57:39 +08:00
|
|
|
* Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
|
|
|
|
* non-lru movable pages and hugepages). We scan pfn because it's much
|
|
|
|
* easier than scanning over linked list. This function returns the pfn
|
|
|
|
* of the first found movable page if it's found, otherwise 0.
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2013-09-12 05:22:09 +08:00
|
|
|
static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
struct page *page;
|
|
|
|
for (pfn = start; pfn < end; pfn++) {
|
|
|
|
if (pfn_valid(pfn)) {
|
|
|
|
page = pfn_to_page(pfn);
|
|
|
|
if (PageLRU(page))
|
|
|
|
return pfn;
|
2017-02-25 06:57:39 +08:00
|
|
|
if (__PageMovable(page))
|
|
|
|
return pfn;
|
2013-09-12 05:22:09 +08:00
|
|
|
if (PageHuge(page)) {
|
2018-09-05 06:45:59 +08:00
|
|
|
if (hugepage_migration_supported(page_hstate(page)) &&
|
|
|
|
page_huge_active(page))
|
2013-09-12 05:22:09 +08:00
|
|
|
return pfn;
|
|
|
|
else
|
|
|
|
pfn = round_up(pfn + 1,
|
|
|
|
1 << compound_order(page)) - 1;
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-04-11 07:30:03 +08:00
|
|
|
static struct page *new_node_page(struct page *page, unsigned long private)
|
2016-07-29 06:48:53 +08:00
|
|
|
{
|
|
|
|
int nid = page_to_nid(page);
|
2016-09-29 06:22:38 +08:00
|
|
|
nodemask_t nmask = node_states[N_MEMORY];
|
2017-07-11 06:48:41 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* try to allocate from a different node but reuse this node if there
|
|
|
|
* are no other online nodes to be used (e.g. we are offlining a part
|
|
|
|
* of the only existing node)
|
|
|
|
*/
|
|
|
|
node_clear(nid, nmask);
|
|
|
|
if (nodes_empty(nmask))
|
|
|
|
node_set(nid, nmask);
|
2016-07-29 06:48:53 +08:00
|
|
|
|
2017-07-11 06:48:47 +08:00
|
|
|
return new_page_nodemask(page, nid, &nmask);
|
2016-07-29 06:48:53 +08:00
|
|
|
}
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
static int
|
|
|
|
do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
struct page *page;
|
|
|
|
int not_managed = 0;
|
|
|
|
int ret = 0;
|
|
|
|
LIST_HEAD(source);
|
|
|
|
|
2018-12-28 16:38:29 +08:00
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
|
2007-10-16 16:26:12 +08:00
|
|
|
if (!pfn_valid(pfn))
|
|
|
|
continue;
|
|
|
|
page = pfn_to_page(pfn);
|
2013-09-12 05:22:09 +08:00
|
|
|
|
|
|
|
if (PageHuge(page)) {
|
|
|
|
struct page *head = compound_head(page);
|
|
|
|
pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
|
|
|
|
if (compound_order(head) > PFN_SECTION_SHIFT) {
|
|
|
|
ret = -EBUSY;
|
|
|
|
break;
|
|
|
|
}
|
2018-12-28 16:38:29 +08:00
|
|
|
isolate_huge_page(page, &source);
|
2013-09-12 05:22:09 +08:00
|
|
|
continue;
|
2018-04-11 07:30:07 +08:00
|
|
|
} else if (PageTransHuge(page))
|
2017-09-09 07:11:15 +08:00
|
|
|
pfn = page_to_pfn(compound_head(page))
|
|
|
|
+ hpage_nr_pages(page) - 1;
|
2013-09-12 05:22:09 +08:00
|
|
|
|
hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.
Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===
int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);
fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");
sleep (20);
ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
return 0;
}
===
+ offline this memory.
In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel: [<ffffffff81215f08>] do_execve+0x28/0x30
kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80
And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.
With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.
Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.
Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:
: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.
Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.com>
Debugged-by: Oscar Salvador <osalvador@suse.com>
Tested-by: Oscar Salvador <osalvador@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 16:38:01 +08:00
|
|
|
/*
|
|
|
|
* HWPoison pages have elevated reference counts so the migration would
|
|
|
|
* fail on them. It also doesn't make any sense to migrate them in the
|
|
|
|
* first place. Still try to unmap such a page in case it is still mapped
|
|
|
|
* (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
|
|
|
|
* the unmap as the catch all safety net).
|
|
|
|
*/
|
|
|
|
if (PageHWPoison(page)) {
|
|
|
|
if (WARN_ON(PageLRU(page)))
|
|
|
|
isolate_lru_page(page);
|
|
|
|
if (page_mapped(page))
|
|
|
|
try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2011-05-25 08:12:19 +08:00
|
|
|
if (!get_page_unless_zero(page))
|
2007-10-16 16:26:12 +08:00
|
|
|
continue;
|
|
|
|
/*
|
2017-02-25 06:57:39 +08:00
|
|
|
* We can skip free pages. And we can deal with pages on
|
|
|
|
* LRU and non-lru movable pages.
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2017-02-25 06:57:39 +08:00
|
|
|
if (PageLRU(page))
|
|
|
|
ret = isolate_lru_page(page);
|
|
|
|
else
|
|
|
|
ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
if (!ret) { /* Success */
|
2011-05-25 08:12:19 +08:00
|
|
|
put_page(page);
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
list_add_tail(&page->lru, &source);
|
2017-02-25 06:57:39 +08:00
|
|
|
if (!__PageMovable(page))
|
|
|
|
inc_node_page_state(page, NR_ISOLATED_ANON +
|
|
|
|
page_is_file_cache(page));
|
2009-12-15 09:58:11 +08:00
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
} else {
|
2018-12-28 16:33:53 +08:00
|
|
|
pr_warn("failed to isolate pfn %lx\n", pfn);
|
2017-02-25 06:57:39 +08:00
|
|
|
dump_page(page, "isolation failed");
|
2011-05-25 08:12:19 +08:00
|
|
|
put_page(page);
|
2011-03-31 09:57:33 +08:00
|
|
|
/* Because we don't have big zone->lock. we should
|
2010-10-27 05:22:10 +08:00
|
|
|
check this again here. */
|
|
|
|
if (page_count(page)) {
|
|
|
|
not_managed++;
|
2010-10-27 05:22:10 +08:00
|
|
|
ret = -EBUSY;
|
2010-10-27 05:22:10 +08:00
|
|
|
break;
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
}
|
2010-10-27 05:22:10 +08:00
|
|
|
if (!list_empty(&source)) {
|
|
|
|
if (not_managed) {
|
2013-09-12 05:22:09 +08:00
|
|
|
putback_movable_pages(&source);
|
2010-10-27 05:22:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2012-10-09 07:32:54 +08:00
|
|
|
|
2016-07-29 06:48:53 +08:00
|
|
|
/* Allocate a new page from the nearest neighbor node */
|
|
|
|
ret = migrate_pages(&source, new_node_page, NULL, 0,
|
2013-02-23 08:35:14 +08:00
|
|
|
MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
|
2018-12-28 16:33:53 +08:00
|
|
|
if (ret) {
|
|
|
|
list_for_each_entry(page, &source, lru) {
|
|
|
|
pr_warn("migrating pfn %lx failed ret:%d ",
|
|
|
|
page_to_pfn(page), ret);
|
|
|
|
dump_page(page, "migration failure");
|
|
|
|
}
|
2013-09-12 05:22:09 +08:00
|
|
|
putback_movable_pages(&source);
|
2018-12-28 16:33:53 +08:00
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* remove from free_area[] and mark all as Reserved.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
__offline_isolated_pages(start, start + nr_pages);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2009-09-23 07:45:46 +08:00
|
|
|
walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
|
2007-10-16 16:26:12 +08:00
|
|
|
offline_isolated_pages_cb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check all pages in range, recoreded as memory resource, are isolated.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
long offlined = *(long *)data;
|
2012-12-12 08:00:45 +08:00
|
|
|
ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
|
2007-10-16 16:26:12 +08:00
|
|
|
offlined = nr_pages;
|
|
|
|
if (!ret)
|
|
|
|
*(long *)data += offlined;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long
|
|
|
|
check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
long offlined = 0;
|
|
|
|
int ret;
|
|
|
|
|
2009-09-23 07:45:46 +08:00
|
|
|
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
|
2007-10-16 16:26:12 +08:00
|
|
|
check_pages_isolated_cb);
|
|
|
|
if (ret < 0)
|
|
|
|
offlined = (long)ret;
|
|
|
|
return offlined;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
static int __init cmdline_parse_movable_node(char *p)
|
|
|
|
{
|
2017-07-07 06:41:05 +08:00
|
|
|
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
|
2014-01-22 07:49:35 +08:00
|
|
|
movable_node_enabled = true;
|
2017-07-07 06:41:05 +08:00
|
|
|
#else
|
|
|
|
pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
|
|
|
|
#endif
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
early_param("movable_node", cmdline_parse_movable_node);
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/* check which state of node_states will be changed when offline memory */
|
|
|
|
static void node_states_check_changes_offline(unsigned long nr_pages,
|
|
|
|
struct zone *zone, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
unsigned long present_pages = 0;
|
2018-10-27 06:07:38 +08:00
|
|
|
enum zone_type zt;
|
2012-12-12 08:01:03 +08:00
|
|
|
|
2018-10-27 06:07:38 +08:00
|
|
|
arg->status_change_nid = -1;
|
|
|
|
arg->status_change_nid_normal = -1;
|
|
|
|
arg->status_change_nid_high = -1;
|
2012-12-12 08:01:03 +08:00
|
|
|
|
|
|
|
/*
|
2018-10-27 06:07:38 +08:00
|
|
|
* Check whether node_states[N_NORMAL_MEMORY] will be changed.
|
|
|
|
* If the memory to be offline is within the range
|
|
|
|
* [0..ZONE_NORMAL], and it is the last present memory there,
|
|
|
|
* the zones in that range will become empty after the offlining,
|
|
|
|
* thus we can determine that we need to clear the node from
|
|
|
|
* node_states[N_NORMAL_MEMORY].
|
2012-12-12 08:01:03 +08:00
|
|
|
*/
|
2018-10-27 06:07:38 +08:00
|
|
|
for (zt = 0; zt <= ZONE_NORMAL; zt++)
|
2012-12-12 08:01:03 +08:00
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
2018-10-27 06:07:38 +08:00
|
|
|
if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
|
2012-12-12 08:01:03 +08:00
|
|
|
arg->status_change_nid_normal = zone_to_nid(zone);
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
/*
|
2018-10-27 06:07:38 +08:00
|
|
|
* node_states[N_HIGH_MEMORY] contains nodes which
|
|
|
|
* have normal memory or high memory.
|
|
|
|
* Here we add the present_pages belonging to ZONE_HIGHMEM.
|
|
|
|
* If the zone is within the range of [0..ZONE_HIGHMEM), and
|
|
|
|
* we determine that the zones in that range become empty,
|
|
|
|
* we need to clear the node for N_HIGH_MEMORY.
|
2012-12-13 05:51:49 +08:00
|
|
|
*/
|
2018-10-27 06:07:38 +08:00
|
|
|
present_pages += pgdat->node_zones[ZONE_HIGHMEM].present_pages;
|
|
|
|
if (zone_idx(zone) <= ZONE_HIGHMEM && nr_pages >= present_pages)
|
2012-12-13 05:51:49 +08:00
|
|
|
arg->status_change_nid_high = zone_to_nid(zone);
|
|
|
|
#endif
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/*
|
2018-10-27 06:07:38 +08:00
|
|
|
* We have accounted the pages from [0..ZONE_NORMAL), and
|
|
|
|
* in case of CONFIG_HIGHMEM the pages from ZONE_HIGHMEM
|
|
|
|
* as well.
|
|
|
|
* Here we count the possible pages from ZONE_MOVABLE.
|
|
|
|
* If after having accounted all the pages, we see that the nr_pages
|
|
|
|
* to be offlined is over or equal to the accounted pages,
|
|
|
|
* we know that the node will become empty, and so, we can clear
|
|
|
|
* it for N_MEMORY as well.
|
2012-12-12 08:01:03 +08:00
|
|
|
*/
|
2018-10-27 06:07:38 +08:00
|
|
|
present_pages += pgdat->node_zones[ZONE_MOVABLE].present_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
|
|
|
|
if (nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid = zone_to_nid(zone);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void node_states_clear_node(int node, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
if (arg->status_change_nid_normal >= 0)
|
|
|
|
node_clear_state(node, N_NORMAL_MEMORY);
|
|
|
|
|
2018-10-27 06:07:28 +08:00
|
|
|
if (arg->status_change_nid_high >= 0)
|
2012-12-12 08:01:03 +08:00
|
|
|
node_clear_state(node, N_HIGH_MEMORY);
|
2012-12-13 05:51:49 +08:00
|
|
|
|
2018-10-27 06:07:28 +08:00
|
|
|
if (arg->status_change_nid >= 0)
|
2012-12-13 05:51:49 +08:00
|
|
|
node_clear_state(node, N_MEMORY);
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
2012-10-09 07:33:58 +08:00
|
|
|
static int __ref __offline_pages(unsigned long start_pfn,
|
2017-11-16 09:33:38 +08:00
|
|
|
unsigned long end_pfn)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
2017-11-16 09:33:38 +08:00
|
|
|
unsigned long pfn, nr_pages;
|
2007-10-16 16:26:12 +08:00
|
|
|
long offlined_pages;
|
2017-11-16 09:33:34 +08:00
|
|
|
int ret, node;
|
2013-07-04 06:02:11 +08:00
|
|
|
unsigned long flags;
|
2017-02-04 05:13:23 +08:00
|
|
|
unsigned long valid_start, valid_end;
|
2007-10-16 16:26:12 +08:00
|
|
|
struct zone *zone;
|
2007-10-22 07:41:36 +08:00
|
|
|
struct memory_notify arg;
|
2018-12-28 16:33:49 +08:00
|
|
|
char *reason;
|
2007-10-16 16:26:12 +08:00
|
|
|
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
mem_hotplug_begin();
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/* This makes hotplug much easier...and readable.
|
|
|
|
we assume this for now. .*/
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
|
|
|
|
&valid_end)) {
|
|
|
|
mem_hotplug_done();
|
2018-12-28 16:33:49 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
reason = "multizone range";
|
|
|
|
goto failed_removal;
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
}
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2017-02-04 05:13:23 +08:00
|
|
|
zone = page_zone(pfn_to_page(valid_start));
|
2007-10-22 07:41:36 +08:00
|
|
|
node = zone_to_nid(zone);
|
|
|
|
nr_pages = end_pfn - start_pfn;
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/* set above range as isolated */
|
2012-12-12 08:00:45 +08:00
|
|
|
ret = start_isolate_page_range(start_pfn, end_pfn,
|
2018-12-28 16:33:56 +08:00
|
|
|
MIGRATE_MOVABLE,
|
|
|
|
SKIP_HWPOISON | REPORT_FAILURE);
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
if (ret) {
|
|
|
|
mem_hotplug_done();
|
2018-12-28 16:33:49 +08:00
|
|
|
reason = "failure to isolate range";
|
|
|
|
goto failed_removal;
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
}
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
arg.start_pfn = start_pfn;
|
|
|
|
arg.nr_pages = nr_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_check_changes_offline(nr_pages, zone, &arg);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
ret = memory_notify(MEM_GOING_OFFLINE, &arg);
|
|
|
|
ret = notifier_to_errno(ret);
|
2018-12-28 16:33:49 +08:00
|
|
|
if (ret) {
|
|
|
|
reason = "notifier failure";
|
|
|
|
goto failed_removal_isolated;
|
|
|
|
}
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2018-12-28 16:38:32 +08:00
|
|
|
do {
|
|
|
|
for (pfn = start_pfn; pfn;) {
|
|
|
|
if (signal_pending(current)) {
|
|
|
|
ret = -EINTR;
|
|
|
|
reason = "signal backoff";
|
|
|
|
goto failed_removal_isolated;
|
|
|
|
}
|
2017-11-16 09:33:34 +08:00
|
|
|
|
2018-12-28 16:38:32 +08:00
|
|
|
cond_resched();
|
|
|
|
lru_add_drain_all();
|
|
|
|
drain_all_pages(zone);
|
|
|
|
|
|
|
|
pfn = scan_movable_pages(pfn, end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
/*
|
|
|
|
* TODO: fatal migration failures should bail
|
|
|
|
* out
|
|
|
|
*/
|
|
|
|
do_migrate_range(pfn, end_pfn);
|
|
|
|
}
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
|
2018-12-28 16:38:32 +08:00
|
|
|
/*
|
|
|
|
* Dissolve free hugepages in the memory block before doing
|
|
|
|
* offlining actually in order to make hugetlbfs's object
|
|
|
|
* counting consistent.
|
|
|
|
*/
|
|
|
|
ret = dissolve_free_huge_pages(start_pfn, end_pfn);
|
|
|
|
if (ret) {
|
|
|
|
reason = "failure to dissolve huge pages";
|
|
|
|
goto failed_removal_isolated;
|
|
|
|
}
|
|
|
|
/* check again */
|
|
|
|
offlined_pages = check_pages_isolated(start_pfn, end_pfn);
|
|
|
|
} while (offlined_pages < 0);
|
2017-11-16 09:33:34 +08:00
|
|
|
|
2016-03-18 05:19:35 +08:00
|
|
|
pr_info("Offlined Pages %ld\n", offlined_pages);
|
2012-09-20 09:48:02 +08:00
|
|
|
/* Ok, all of our target is isolated.
|
2007-10-16 16:26:12 +08:00
|
|
|
We cannot do rollback at this point. */
|
|
|
|
offline_isolated_pages(start_pfn, end_pfn);
|
2007-11-15 08:59:12 +08:00
|
|
|
/* reset pagetype flags and makes migrate type to be MOVABLE */
|
2012-04-03 21:06:15 +08:00
|
|
|
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* removal success */
|
2013-07-04 06:03:21 +08:00
|
|
|
adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
|
2007-10-16 16:26:12 +08:00
|
|
|
zone->present_pages -= offlined_pages;
|
2013-07-04 06:02:11 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
2007-10-16 16:26:12 +08:00
|
|
|
zone->zone_pgdat->node_present_pages -= offlined_pages;
|
2013-07-04 06:02:11 +08:00
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2011-05-25 08:11:32 +08:00
|
|
|
init_per_zone_wmark_min();
|
|
|
|
|
2012-10-09 07:31:51 +08:00
|
|
|
if (!populated_zone(zone)) {
|
2012-08-01 07:43:32 +08:00
|
|
|
zone_pcp_reset(zone);
|
2017-09-07 07:20:24 +08:00
|
|
|
build_all_zonelists(NULL);
|
2012-10-09 07:31:51 +08:00
|
|
|
} else
|
|
|
|
zone_pcp_update(zone);
|
2012-08-01 07:43:32 +08:00
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_clear_node(node, &arg);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
if (arg.status_change_nid >= 0) {
|
2009-12-15 09:58:33 +08:00
|
|
|
kswapd_stop(node);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
kcompactd_stop(node);
|
|
|
|
}
|
2009-06-17 06:32:50 +08:00
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
vm_total_pages = nr_free_pagecache_pages();
|
|
|
|
writeback_set_ratelimit();
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
memory_notify(MEM_OFFLINE, &arg);
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
mem_hotplug_done();
|
2007-10-16 16:26:12 +08:00
|
|
|
return 0;
|
|
|
|
|
2018-12-28 16:33:49 +08:00
|
|
|
failed_removal_isolated:
|
|
|
|
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
failed_removal:
|
2018-12-28 16:33:49 +08:00
|
|
|
pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
|
2016-03-18 05:19:35 +08:00
|
|
|
(unsigned long long) start_pfn << PAGE_SHIFT,
|
2018-12-28 16:33:49 +08:00
|
|
|
((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
|
|
|
|
reason);
|
2007-10-22 07:41:36 +08:00
|
|
|
memory_notify(MEM_CANCEL_OFFLINE, &arg);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* pushback to free area */
|
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3be ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
a) device_lock()
b) mem_hotplug_lock
While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.
In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.
The problems I spotted related to this:
1. Memory block device attributes: While .state first calls
mem_hotplug_begin() and the calls device_online() - which takes
device_lock() - .online does no longer call mem_hotplug_begin(), so
effectively calls online_pages() without mem_hotplug_lock.
2. device_online() should be called under device_hotplug_lock, however
onlining memory during add_memory() does not take care of that.
In addition, I think there is also something wrong about the locking in
3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
without locks. This was introduced after 30467e0b3be. And skimming over
the code, I assume it could need some more care in regards to locking
(e.g. device_online() called without device_hotplug_lock. This will
be addressed in the following patches.
Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()
We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().
Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).
[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
2015-February/065324.html
This patch is partly based on a patch by Vitaly Kuznetsov.
Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:29 +08:00
|
|
|
mem_hotplug_done();
|
2007-10-16 16:26:12 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-10-19 11:25:58 +08:00
|
|
|
|
2012-10-09 07:33:58 +08:00
|
|
|
int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
|
|
|
|
{
|
2017-11-16 09:33:38 +08:00
|
|
|
return __offline_pages(start_pfn, start_pfn + nr_pages);
|
2012-10-09 07:33:58 +08:00
|
|
|
}
|
2013-05-08 06:29:49 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|
2012-10-09 07:33:58 +08:00
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
/**
|
|
|
|
* walk_memory_range - walks through all mem sections in [start_pfn, end_pfn)
|
|
|
|
* @start_pfn: start pfn of the memory range
|
2013-04-30 06:06:16 +08:00
|
|
|
* @end_pfn: end pfn of the memory range
|
2013-02-23 08:32:54 +08:00
|
|
|
* @arg: argument passed to func
|
|
|
|
* @func: callback for each memory section walked
|
|
|
|
*
|
|
|
|
* This function walks through all present mem sections in range
|
|
|
|
* [start_pfn, end_pfn) and call func on each mem section.
|
|
|
|
*
|
|
|
|
* Returns the return value of func.
|
|
|
|
*/
|
2013-05-08 06:29:49 +08:00
|
|
|
int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
|
2013-02-23 08:32:54 +08:00
|
|
|
void *arg, int (*func)(struct memory_block *, void *))
|
2008-10-19 11:25:58 +08:00
|
|
|
{
|
2012-10-09 07:34:01 +08:00
|
|
|
struct memory_block *mem = NULL;
|
|
|
|
struct mem_section *section;
|
|
|
|
unsigned long pfn, section_nr;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
section_nr = pfn_to_section_nr(pfn);
|
|
|
|
if (!present_section_nr(section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
section = __nr_to_section(section_nr);
|
|
|
|
/* same memblock? */
|
|
|
|
if (mem)
|
|
|
|
if ((section_nr >= mem->start_section_nr) &&
|
|
|
|
(section_nr <= mem->end_section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
mem = find_memory_block_hinted(section, mem);
|
|
|
|
if (!mem)
|
|
|
|
continue;
|
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
ret = func(mem, arg);
|
2012-10-09 07:34:01 +08:00
|
|
|
if (ret) {
|
2013-02-23 08:32:54 +08:00
|
|
|
kobject_put(&mem->dev.kobj);
|
|
|
|
return ret;
|
2012-10-09 07:34:01 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mem)
|
|
|
|
kobject_put(&mem->dev.kobj);
|
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-08 06:29:49 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2013-11-13 07:07:20 +08:00
|
|
|
static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
|
2013-02-23 08:32:54 +08:00
|
|
|
{
|
|
|
|
int ret = !is_memblock_offlined(mem);
|
|
|
|
|
2013-04-30 06:08:49 +08:00
|
|
|
if (unlikely(ret)) {
|
|
|
|
phys_addr_t beginpa, endpa;
|
|
|
|
|
|
|
|
beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
|
|
|
|
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
|
2016-03-18 05:19:47 +08:00
|
|
|
pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
|
2013-04-30 06:08:49 +08:00
|
|
|
&beginpa, &endpa);
|
|
|
|
}
|
2013-02-23 08:32:54 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static int check_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:14 +08:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_present_cpu(cpu) {
|
|
|
|
if (cpu_to_node(cpu) == pgdat->node_id)
|
|
|
|
/*
|
|
|
|
* the cpu on this node isn't removed, and we can't
|
|
|
|
* offline this node.
|
|
|
|
*/
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
/**
|
|
|
|
* try_offline_node
|
2018-04-06 07:24:57 +08:00
|
|
|
* @nid: the node ID
|
2013-09-12 05:21:50 +08:00
|
|
|
*
|
|
|
|
* Offline a node if all memory sections and cpus of the node are removed.
|
|
|
|
*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations before this call.
|
|
|
|
*/
|
2013-02-23 08:33:27 +08:00
|
|
|
void try_offline_node(int nid)
|
2013-02-23 08:33:14 +08:00
|
|
|
{
|
2013-02-23 08:33:16 +08:00
|
|
|
pg_data_t *pgdat = NODE_DATA(nid);
|
|
|
|
unsigned long start_pfn = pgdat->node_start_pfn;
|
|
|
|
unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
|
2013-02-23 08:33:14 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
unsigned long section_nr = pfn_to_section_nr(pfn);
|
|
|
|
|
|
|
|
if (!present_section_nr(section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (pfn_to_nid(pfn) != nid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* some memory sections of this node are not removed, and we
|
|
|
|
* can't offline node now.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2018-12-28 16:34:13 +08:00
|
|
|
if (check_cpu_on_node(pgdat))
|
2013-02-23 08:33:14 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* all memory/cpu of this node are removed, we can offline this
|
|
|
|
* node now.
|
|
|
|
*/
|
|
|
|
node_set_offline(nid);
|
|
|
|
unregister_one_node(nid);
|
|
|
|
}
|
2013-02-23 08:33:27 +08:00
|
|
|
EXPORT_SYMBOL(try_offline_node);
|
2013-02-23 08:33:14 +08:00
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
/**
|
|
|
|
* remove_memory
|
2018-04-06 07:24:57 +08:00
|
|
|
* @nid: the node ID
|
|
|
|
* @start: physical address of the region to remove
|
|
|
|
* @size: size of the region to remove
|
2013-09-12 05:21:50 +08:00
|
|
|
*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations before this call, as required by
|
|
|
|
* try_offline_node().
|
|
|
|
*/
|
mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock. And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.
While e.g.
echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.
E.g. via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock. So we can
have concurrent callers in online_pages(). We e.g. touch in
online_pages() basically unprotected zone->present_pages then.
Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible. We
would e.g. have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.
Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.
I propose the general rules (documentation added in patch 6):
1. add_memory/add_memory_resource() must only be called with
device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
device_hotplug_lock. This is already documented and true for now in core
code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
online_pages/offline_pages.
To me, this looks way cleaner than what we have right now (and easier to
verify). And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.
This patch (of 6):
remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported. So let's provide a variant
that takes the lock and only export that one.
The lock is already held in
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
arch/powerpc/platforms/powernv/memtrace.c
Apart from that, there are not other users in the tree.
Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:18 +08:00
|
|
|
void __ref __remove_memory(int nid, u64 start, u64 size)
|
2013-02-23 08:32:54 +08:00
|
|
|
{
|
2013-05-27 18:58:46 +08:00
|
|
|
int ret;
|
memory-hotplug: try to offline the memory twice to avoid dependence
memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G).
You will find 4 new directories memory8, memory9, memory10, and memory11
under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup when we online pages. When we online memory8, the memory stored
page cgroup is not provided by this memory device. But when we online
memory9, the memory stored page cgroup may be provided by memory8. So
we can't offline memory8 now. We should offline the memory in the
reversed order.
When the memory device is hotremoved, we will auto offline memory
provided by this memory device. But we don't know which memory is
onlined first, so offlining memory may fail. In such case, iterate
twice to offline the memory. 1st iterate: offline every non primary
memory block. 2nd iterate: offline primary (i.e. first added) memory
block.
This idea is suggested by KOSAKI Motohiro.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 08:32:50 +08:00
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
BUG_ON(check_hotplug_memory_range(start, size));
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2013-02-23 08:32:52 +08:00
|
|
|
|
|
|
|
/*
|
2013-05-27 18:58:46 +08:00
|
|
|
* All memory blocks must be offlined before removing memory. Check
|
|
|
|
* whether all memory blocks in question are offline and trigger a BUG()
|
|
|
|
* if this is not the case.
|
2013-02-23 08:32:52 +08:00
|
|
|
*/
|
2013-05-27 18:58:46 +08:00
|
|
|
ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
|
2013-11-13 07:07:20 +08:00
|
|
|
check_memblock_offlined_cb);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
if (ret)
|
2013-05-27 18:58:46 +08:00
|
|
|
BUG();
|
2013-02-23 08:32:52 +08:00
|
|
|
|
2013-02-23 08:32:56 +08:00
|
|
|
/* remove memmap entry */
|
|
|
|
firmware_map_remove(start, start + size, "System RAM");
|
2015-08-15 06:35:16 +08:00
|
|
|
memblock_free(start, size);
|
|
|
|
memblock_remove(start, size);
|
2013-02-23 08:32:56 +08:00
|
|
|
|
2018-12-28 16:36:22 +08:00
|
|
|
arch_remove_memory(nid, start, size, NULL);
|
2013-02-23 08:32:58 +08:00
|
|
|
|
2013-02-23 08:33:14 +08:00
|
|
|
try_offline_node(nid);
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2008-10-19 11:25:58 +08:00
|
|
|
}
|
mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock. And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.
While e.g.
echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.
E.g. via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock. So we can
have concurrent callers in online_pages(). We e.g. touch in
online_pages() basically unprotected zone->present_pages then.
Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible. We
would e.g. have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.
Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.
I propose the general rules (documentation added in patch 6):
1. add_memory/add_memory_resource() must only be called with
device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
device_hotplug_lock. This is already documented and true for now in core
code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
online_pages/offline_pages.
To me, this looks way cleaner than what we have right now (and easier to
verify). And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.
This patch (of 6):
remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported. So let's provide a variant
that takes the lock and only export that one.
The lock is already held in
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
arch/powerpc/platforms/powernv/memtrace.c
Apart from that, there are not other users in the tree.
Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:10:18 +08:00
|
|
|
|
|
|
|
void remove_memory(int nid, u64 start, u64 size)
|
|
|
|
{
|
|
|
|
lock_device_hotplug();
|
|
|
|
__remove_memory(nid, start, size);
|
|
|
|
unlock_device_hotplug();
|
|
|
|
}
|
2008-10-19 11:25:58 +08:00
|
|
|
EXPORT_SYMBOL_GPL(remove_memory);
|
2013-06-02 04:24:07 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|