We need a node which only contains movable memory. This feature is very
important for node hotplug. If a node has normal/highmem, the memory may
be used by the kernel and can't be offlined. If the node only contains
movable memory, we can offline the memory and the node.
All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node
[akpm@linux-foundation.org: fix Kconfig text]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While profiling numa/core v16 with cgroup_disable=memory on the command
line, I noticed mem_cgroup_count_vm_event() still showed up as high as
0.60% in perftop.
This occurs because the function is called extremely often even when memcg
is disabled.
To fix this, inline the check for mem_cgroup_disabled() so we avoid the
unnecessary function call if memcg is disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During reviewing the source code, I found a comment which mention that
after f_op->mmap(), vma's start address can be changed. I didn't verify
that it is really possible, because there are so many f_op->mmap()
implementation. But if there are some mmap() which change vma's start
address, it is possible error situation, because we already prepare prev
vma, rb_link and rb_parent and these are related to original address.
So add WARN_ON_ONCE for finding that this situtation really happens.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Since we introduced N_MEMORY, we update the initialization of node_states.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have N_NORMAL_MEMORY for standing for the nodes that have normal memory
with zone_type <= ZONE_NORMAL.
And we have N_HIGH_MEMORY for standing for the nodes that have normal or
high memory.
But we don't have any word to stand for the nodes that have *any* memory.
And we have N_CPU but without N_MEMORY.
Current code reuse the N_HIGH_MEMORY for this purpose because any node
which has memory must have high memory or normal memory currently.
A) But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:
A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)
The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.
A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:
A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:
One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {
It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.
The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);
It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.
B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.
In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.
The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.
In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late
patches.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_contig_migrate_range() should use all possible ways to get all the
pages migrated from the given memory range, so pruning per-cpu lru lists
for all CPUs is required, regadless the cost of such operation. Otherwise
some pages which got stuck at per-cpu lru list might get missed by
migration procedure causing the contiguous allocation to fail.
Reported-by: SeongHwan Yoon <sunghwan.yun@samsung.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
compact_capture_page() is only used if compaction is enabled so it should
be moved into the corresponding #ifdef.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pmd value is stable only with mm->page_table_lock taken. After taking
the lock we need to check that nobody modified the pmd before changing it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By default kernel tries to use huge zero page on read page fault. It's
possible to disable huge zero page by writing 0 or enable it back by
writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hzp_alloc is incremented every time a huge zero page is successfully
allocated. It includes allocations which where dropped due
race with other allocation. Note, it doesn't count every map
of the huge zero page, only its allocation.
hzp_alloc_failed is incremented if kernel fails to allocate huge zero
page and falls back to using small pages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.
We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.
If counter is 0, get_huge_zero_page() allocates a new huge page and takes
two references: one for caller and one for shrinker. We free the page
only in shrinker callback if counter is 1 (only shrinker has the
reference).
put_huge_zero_page() only decrements counter. Counter is never zero in
put_huge_zero_page() since shrinker holds on reference.
Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.
Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.
[lliubbo@gmail.com: fix mismerge]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.
cmpxchg() is used to avoid race on huge_zero_pfn initialization.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All code paths seems covered. Now we can map huge zero page on read page
fault.
We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup huge zero page (ENOMEM) we fallback to
handle_pte_fault() as we normally do in THP.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can't split huge zero page itself (and it's bug if we try), but we
can split the pmd which points to it.
On splitting the pmd we create a table with all ptes set to normal zero
page.
[akpm@linux-foundation.org: fix build error]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass vma instead of mm and add address parameter.
In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
This change is preparation to huge zero pmd splitting implementation.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mprotect core never tries to make page writable using change_huge_pmd().
Let's add an assert that the assumption is true. It's important to be
sure we will not make huge zero page writable.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On write access to huge zero page we alloc a new huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's easy to copy huge zero page. Just set destination pmd to huge zero
page.
It's safe to copy huge zero page since we have none yet :-p
[rientjes@google.com: fix comment]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't have a mapped page to zap in huge zero page case. Let's just clear
pmd and remove it from tlb.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.
The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.
A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>
#define MB 1024*1024
int main(int argc, char **argv)
{
char *p;
int i;
posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}
With thp-never RSS is about 400k, but with thp-always it's 200M. After
the patcheset thp-always RSS is 400k too.
Design overview.
Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros. The way how we allocate it changes in the patchset:
- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;
We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault. If we fail to setup
hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP.
On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte
around fault address to newly allocated normal (4k) page. All other ptes
in the pmd set to normal zero page.
We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.
===
By hpa's request I've tried alternative approach for hzp implementation
(see Virtual huge zero page patchset): pmd table with all entries set to
zero page. This way should be more cache friendly, but it increases TLB
pressure.
The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.
Some numbers to compare two implementations (on 4s Westmere-EX):
Mirobenchmark1
==============
test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}
hzp:
Performance counter stats for './test_memcmp' (5 runs):
32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
40 context-switches # 0.001 K/sec ( +- 0.94% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.130 K/sec ( +- 0.00% )
76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
134,355,715,816 instructions # 1.75 insns per cycle
# 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
32.413866442 seconds time elapsed ( +- 0.13% )
vhzp:
Performance counter stats for './test_memcmp' (5 runs):
30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
38 context-switches # 0.001 K/sec ( +- 1.53% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.139 K/sec ( +- 0.01% )
71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
134,982,215,437 instructions # 1.88 insns per cycle
# 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
30.381324695 seconds time elapsed ( +- 0.13% )
Mirobenchmark2
==============
test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 1000; i++) {
char *_p = p;
while (_p < p+4*GB) {
assert(*_p == *(_p+4*GB));
_p += 4096;
asm volatile ("": : :"memory");
}
}
hzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
9 context-switches # 0.003 K/sec ( +- 4.97% )
4,384 page-faults # 0.001 M/sec ( +- 0.00% )
8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
9,494,670,537 instructions # 1.14 insns per cycle
# 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
3,168,102,115 L1-dcache-loads
# 903.693 M/sec ( +- 0.11% ) [41.70%]
1,048,710,998 L1-dcache-misses
# 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
1,047,699,685 LLC-load
# 298.854 M/sec ( +- 0.03% ) [33.38%]
2,287 LLC-misses
# 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
3,166,187,367 dTLB-loads
# 903.147 M/sec ( +- 0.02% ) [33.35%]
4,266,538 dTLB-misses
# 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
3.513339813 seconds time elapsed ( +- 0.26% )
vhzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
62 context-switches # 0.002 K/sec ( +- 0.61% )
4,384 page-faults # 0.160 K/sec ( +- 0.01% )
64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
10,033,724,846 instructions # 0.15 insns per cycle
# 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
3,302,006,540 L1-dcache-loads
# 120.891 M/sec ( +- 0.11% ) [41.68%]
271,374,358 L1-dcache-misses
# 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
20,385,476 LLC-load
# 0.746 M/sec ( +- 1.64% ) [33.34%]
76,754 LLC-misses
# 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
3,309,927,290 dTLB-loads
# 121.181 M/sec ( +- 0.03% ) [33.34%]
2,098,967,427 dTLB-misses
# 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
27.364448741 seconds time elapsed ( +- 0.24% )
===
I personally prefer implementation present in this patchset. It doesn't
touch arch-specific code.
This patch:
Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.
For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.
We are not going to map the huge zero page until we can handle it properly
on all code paths.
is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name of this function is not suitable, and removing the function and
open-coding it into each call sites makes the code more understandable.
Additionally, we shouldn't do an allocation from bootmem when
slab_is_available(), so directly return kmalloc()'s return value.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no implementation of bootmem_arch_preferred_node() and a call to
this function will cause a compilation error. So remove it.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull big execve/kernel_thread/fork unification series from Al Viro:
"All architectures are converted to new model. Quite a bit of that
stuff is actually shared with architecture trees; in such cases it's
literally shared branch pulled by both, not a cherry-pick.
A lot of ugliness and black magic is gone (-3KLoC total in this one):
- kernel_thread()/kernel_execve()/sys_execve() redesign.
We don't do syscalls from kernel anymore for either kernel_thread()
or kernel_execve():
kernel_thread() is essentially clone(2) with callback run before we
return to userland, the callbacks either never return or do
successful do_execve() before returning.
kernel_execve() is a wrapper for do_execve() - it doesn't need to
do transition to user mode anymore.
As a result kernel_thread() and kernel_execve() are
arch-independent now - they live in kernel/fork.c and fs/exec.c
resp. sys_execve() is also in fs/exec.c and it's completely
architecture-independent.
- daemonize() is gone, along with its parts in fs/*.c
- struct pt_regs * is no longer passed to do_fork/copy_process/
copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.
- sys_fork()/sys_vfork()/sys_clone() unified; some architectures
still need wrappers (ones with callee-saved registers not saved in
pt_regs on syscall entry), but the main part of those suckers is in
kernel/fork.c now."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
do_coredump(): get rid of pt_regs argument
print_fatal_signal(): get rid of pt_regs argument
ptrace_signal(): get rid of unused arguments
get rid of ptrace_signal_deliver() arguments
new helper: signal_pt_regs()
unify default ptrace_signal_deliver
flagday: kill pt_regs argument of do_fork()
death to idle_regs()
don't pass regs to copy_process()
flagday: don't pass regs to copy_thread()
bfin: switch to generic vfork, get rid of pointless wrappers
xtensa: switch to generic clone()
openrisc: switch to use of generic fork and clone
unicore32: switch to generic clone(2)
score: switch to generic fork/vfork/clone
c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
mn10300: switch to generic fork/vfork/clone
h8300: switch to generic fork/vfork/clone
tile: switch to generic clone()
...
Conflicts:
arch/microblaze/include/asm/Kbuild
This branch contains a set of various board updates for ARM platforms.
A few shmobile platforms that are stale have been removed, some
defconfig updates for various boards selecting new features such as
pinctrl subsystem support, and various updates enabling peripherals, etc.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQyLD0AAoJEIwa5zzehBx3nNYQAI8a1cS/0U1GGXhmYJs4ZJjM
9yeFKOQmv7kQMmqC9b9UhoCtqwsxR4s2qjUps3464q8pZmEXysc+Z5kH/av9ULJw
zp62iTxa/qBfnF+UKCt4PAxoL9KvN36XgPH6OADiQ9rvsJ20zF8Agq22CiSF+8AR
CfX60joQnd/BEBORP86Zm5AlAeXiah/MrUBQuiCgNq+Ew9bQ0/I15UKZzlLKeQST
eYviHjlGPFcXWaxuIgNUQ5KzFfKtSERHHHsiIy9DEi0zq7fLhiwY+eIYUWN34aDp
K+bV7wmh6ufa3/Z024Du5FkSI4fIOLz6b5hlJxOzgvQ3AZJ7JK8IYSXCibgUjWxO
gFxpVj/MYGf+FlWVQpaf/niMuhYqRAbWy1rsGcmyn8+rb8cHxDHs/3Z4dQTmgxMf
6lEdR4npULAP22MSB71yQtPjn9XW1338a0/a5GPdvihZjwpMH2ZXRxAkGuGpamFT
/oXwADuh5a+1uMctXQqb/1FnJcmH/mbEp40bU90AwSqJ6Y8FJts4P5B3H9RQ5rO7
uz/KVB4ilUHGONtrTiNohYbrNLj3t2kNUt8QrQINMWBHSDFHLUbK48QcoqK4uJPB
9RFu9mtovUvfJXVivOoO/BRVnKomUba7fhET1VP4UXiCc7WNtmj36kiXSHbfu3NV
/ox1FAH3UU8jB91OeEqV
=sgYz
-----END PGP SIGNATURE-----
Merge tag 'boards' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC board updates from Olof Johansson:
"This branch contains a set of various board updates for ARM platforms.
A few shmobile platforms that are stale have been removed, some
defconfig updates for various boards selecting new features such as
pinctrl subsystem support, and various updates enabling peripherals,
etc."
Fix up conflicts mostly as per Olof.
* tag 'boards' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (58 commits)
ARM: S3C64XX: Add dummy supplies for Glenfarclas LDOs
ARM: S3C64XX: Add registration of WM2200 Bells device on Cragganmore
ARM: kirkwood: Add Plat'Home OpenBlocks A6 support
ARM: Dove: update defconfig
ARM: Kirkwood: update defconfig for new boards
arm: orion5x: add DT related options in defconfig
arm: orion5x: convert 'LaCie Ethernet Disk mini v2' to Device Tree
arm: orion5x: basic Device Tree support
arm: orion5x: mechanical defconfig update
ARM: kirkwood: Add support for the MPL CEC4
arm: kirkwood: add support for ZyXEL NSA310
ARM: Kirkwood: new board USI Topkick
ARM: kirkwood: use gpio-fan DT binding on lsxl
ARM: Kirkwood: add Netspace boards to defconfig
ARM: kirkwood: DT board setup for Network Space Mini v2
ARM: kirkwood: DT board setup for Network Space Lite v2
ARM: kirkwood: DT board setup for Network Space v2 and parents
leds: leds-ns2: add device tree binding
ARM: Kirkwood: Enable the second I2C bus
ARM: mmp: select pinctrl driver
...
This contains the bulk of new SoC development for this merge window.
Two new platforms have been added, the sunxi platforms (Allwinner A1x
SoCs) by Maxime Ripard, and a generic Broadcom platform for a new
series of ARMv7 platforms from them, where the hope is that we can
keep the platform code generic enough to have them all share one mach
directory. The new Broadcom platform is contributed by Christian Daudt.
Highbank has grown support for Calxeda's next generation of hardware,
ECX-2000.
clps711x has seen a lot of cleanup from Alexander Shiyan, and he's also
taken on maintainership of the platform.
Beyond this there has been a bunch of work from a number of people on
converting more platforms to IRQ domains, pinctrl conversion, cleanup
and general feature enablement across most of the active platforms.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQyLCjAAoJEIwa5zzehBx3AdQP/R+L3+EQMjiEWt/p7g/ql5Em
0SnP92CcGzrjgLTg9z1FeOazfOsGnkZAYUlDRkqfKobH3VqkhYFFtt1/0x0KMahm
xcowHgMBOyimFdWT9vLK3J8U6DLui5XrEG9LGH2VL+lqmfjIyP/OOF3mVc0/+pV9
WTLAsYswdBRSeiNuF43kqlfrOwF6xsPLgiNMlc82w6BzHqoHu6dOif5M9MqWaApS
V74DPmwLD371Tyit6aHqt3JOqpgiPSHlmxkzomK+5idcW3Pa7HnzzFYmx85dk/eN
J2siqIkoOu7tEfjIbNZTL2MYoX4tUUKv4qZZ3IOl3YSWaV3P5ilMApF01XVrkk8E
DWOMhzte9hC7L90W+/kCPLF1VyeAhCem2KQWUitO71fKur3r+3ZaUokNVvWzkJIL
7aduxAJOV2hfLgEqbjbjF3o4S8p63OV3kzivFJM1And15zDJo4+qqOh67+bPo4jj
+R4du+SqzXriw4i3tDLGVpdjDffk4D41tbLzgkWAtvGyoP45yeYfHAzAh0pDFPRv
ASfZVmZ5PhwAUAkIMnpC2sjgmxMYff3SYqmDgnsqXES7rbDH/hG+teymtHFTyUQp
m+f60DNotSMcMvkLdvruLSB4aeTiwbfOqPn/g+aXYUlPuNMq1fVWgN7EJKWkamK4
nRwaJmLwx1/ojcVbpy2G
=YMKB
-----END PGP SIGNATURE-----
Merge tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC updates from Olof Johansson:
"This contains the bulk of new SoC development for this merge window.
Two new platforms have been added, the sunxi platforms (Allwinner A1x
SoCs) by Maxime Ripard, and a generic Broadcom platform for a new
series of ARMv7 platforms from them, where the hope is that we can
keep the platform code generic enough to have them all share one mach
directory. The new Broadcom platform is contributed by Christian
Daudt.
Highbank has grown support for Calxeda's next generation of hardware,
ECX-2000.
clps711x has seen a lot of cleanup from Alexander Shiyan, and he's
also taken on maintainership of the platform.
Beyond this there has been a bunch of work from a number of people on
converting more platforms to IRQ domains, pinctrl conversion, cleanup
and general feature enablement across most of the active platforms."
Fix up trivial conflicts as per Olof.
* tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (174 commits)
mfd: vexpress-sysreg: Remove LEDs code
irqchip: irq-sunxi: Add terminating entry for sunxi_irq_dt_ids
clocksource: sunxi_timer: Add terminating entry for sunxi_timer_dt_ids
irq: versatile: delete dangling variable
ARM: sunxi: add missing include for mdelay()
ARM: EXYNOS: Avoid early use of of_machine_is_compatible()
ARM: dts: add node for PL330 MDMA1 controller for exynos4
ARM: EXYNOS: Add support for secondary CPU bring-up on Exynos4412
ARM: EXYNOS: add UART3 to DEBUG_LL ports
ARM: S3C24XX: Add clkdev entry for camif-upll clock
ARM: SAMSUNG: Add s3c24xx/s3c64xx CAMIF GPIO setup helpers
ARM: sunxi: Add missing sun4i.dtsi file
pinctrl: samsung: Do not initialise statics to 0
ARM i.MX6: remove gate_mask from pllv3
ARM i.MX6: Fix ethernet PLL clocks
ARM i.MX6: rename PLLs according to datasheet
ARM i.MX6: Add pwm support
ARM i.MX51: Add pwm support
ARM i.MX53: Add pwm support
ARM: mx5: Replace clk_register_clkdev with clock DT lookup
...
Cleanup patches for various ARM platforms and some of their associated
drivers. There's also a branch in here that enables Freescale i.MX to be
part of the multiplatform support -- the first "big" SoC that is moved
over (more multiplatform work comes in a separate branch later during
the merge window).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQx2p9AAoJEIwa5zzehBx3aPUQAIjV3VDf/ACkA4KUQu0BFg5U
57OIkl6RCZvfKhYgq5+6OJ2AK6VkGh9PqTmXkDS7Nj3QMS/uWcb3U419aPJsd3Z/
vNGpTl+J/YcAcFrKMqTyNv98TAiAOJlpm70CqmRbkhpMfoJb7//1JKqGTJPBO+tj
8ZEwNGC0WbRNOSQTY/TTAhbZE1sqXwKy9mDLGmcwqKBY8H1TFHyPB6yWYFSxMHxS
JAegbYhYO9FawOOLoi9ovT+2vUR9vDu0xxV4zUK9f5DqKcCb/wYuN0QkusjnEutm
RfIt7iXHHzi35YPxtlrGgSz9EIYXKAafSzkgf3Ydpjci5DH/vbVexm/CT+V+SwOT
SvucYJMALI/aOEFJWN/50L6B9zipSrWb51tK7WFXz/sUCrMQrXH3Mu99mjHZXSoL
1cylsvs3DFQC7vHFLSjRpX6eJdfE+Hb0LZ878eXSbDVCOnU8odAQrofugqfmeVDk
eN0+BWmchJgvljOiKVUQMC3PCquCaAAO1lm/HU7bWPlVigTuHSW0uisDyCYAtlt1
dGxnbbhoFJvSH7CMOoMO7hIFnoNJEe6+uVUuwA/+iJouMXMJLoY6Da4L72h1Mp81
o4Hr6Kxly/SMtURZ/6pCycx5ahb5TaahstYAoe7Qp1dMj5U2m6fUVfKkG7tUx1CW
MIuvN3qJeW2qKWKmZRVM
=zfPd
-----END PGP SIGNATURE-----
Merge tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC cleanups on various subarchitectures from Olof Johansson:
"Cleanup patches for various ARM platforms and some of their associated
drivers. There's also a branch in here that enables Freescale i.MX to
be part of the multiplatform support -- the first "big" SoC that is
moved over (more multiplatform work comes in a separate branch later
during the merge window)."
Conflicts fixed as per Olof, including a silent semantic one in
arch/arm/mach-omap2/board-generic.c (omap_prcm_restart() was renamed to
omap3xxx_restart(), and a new user of the old name was added).
* tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (189 commits)
ARM: omap: fix typo on timer cleanup
ARM: EXYNOS: Remove unused regs-mem.h file
ARM: EXYNOS: Remove unused non-dt support for dwmci controller
ARM: Kirkwood: Use hw_pci.ops instead of hw_pci.scan
ARM: OMAP3: cm-t3517: use GPTIMER for system clock
ARM: OMAP2+: timer: remove CONFIG_OMAP_32K_TIMER
ARM: SAMSUNG: use devm_ functions for ADC driver
ARM: EXYNOS: no duplicate mask/unmask in eint0_15
ARM: S3C24XX: SPI clock channel setup is fixed for S3C2443
ARM: EXYNOS: Remove i2c0 resource information and setting of device names
ARM: Kirkwood: checkpatch cleanups
ARM: Kirkwood: Fix sparse warnings.
ARM: Kirkwood: Remove unused includes
ARM: kirkwood: cleanup lsxl board includes
ARM: integrator: use BUG_ON where possible
ARM: integrator: push down SC dependencies
ARM: integrator: delete static UART1 mapping
ARM: integrator: delete SC mapping on the CP
ARM: integrator: remove static CP syscon mapping
ARM: integrator: remove static AP syscon mapping
...
This is a collection of header file cleanups, mostly for OMAP and AT91,
that keeps moving the platforms in the direction of multiplatform by
removing the need for mach-dependent header files used in drivers and
other places.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQx2reAAoJEIwa5zzehBx3Z/oQAIHp8aXbUD0GvbBPQ2vIydZ8
Qfdh2ypbyFB9GSLmPP6BhurEFUcwmW3MH2r9Kq68omPt4XxrFEt1SdEn3TgZlhTY
YYzEv2DLbWfmKCDBFAEFullsfVkrpggyou5ty30YVp/u2lVLvChpkciE5/9HADy5
uoh/W2KZcLE51tvnbRy/1DBaOtxEdCjJ9hTaef7FCB9ugOG+yMDYwIPRW/b4fNA6
o/1NuZTWp0H4ePRG2oqLxYnjSiF63DVTJZjSJ+c5gW34bWgh8+/xzaLFA9U++/ig
meGUD1Oe65+ctr+Wk2mV29eb02jauUe6qvkwt+iFIDDopgc2mn5BcW+ENpgpjhe3
6l3ClRd94qh4cMWzqDa5PZCdetshiKqSH2IGpKS/dHNB1h5sBP7CCbvbalwzWd5n
qjfkPC7kSFyU3XV+9SEAXE6HLKsiMQy9kRcKOMdTE5BA4FEAZk0wfiG9WcVCvS2D
9tDC3X0aScyXx4Mkd4wIAZVhY68QuI17fPft9zZSj691YiUW5cR2yF1qbCka0yd5
pHu/QSZg1+dUitMUvMPQmWJ15jnAtEYUtV/8zQFFVchgkxlrW+UtvW3zxghB6D+J
ZwcjAMfQQz9fLoMQXlVovjQIhYsjw3SlV62ReBH7eyPs3+KfPIBn7Ks6mVQs3dS5
AMUWORsB5u92XHl9EL9C
=aggA
-----END PGP SIGNATURE-----
Merge tag 'headers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC Header cleanups from Olof Johansson:
"This is a collection of header file cleanups, mostly for OMAP and
AT91, that keeps moving the platforms in the direction of
multiplatform by removing the need for mach-dependent header files
used in drivers and other places."
Fix up mostly trivial conflicts as per Olof.
* tag 'headers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (106 commits)
ARM: OMAP2+: Move iommu/iovmm headers to platform_data
ARM: OMAP2+: Make some definitions local
ARM: OMAP2+: Move iommu2 to drivers/iommu/omap-iommu2.c
ARM: OMAP2+: Move plat/iovmm.h to include/linux/omap-iommu.h
ARM: OMAP2+: Move iopgtable header to drivers/iommu/
ARM: OMAP: Merge iommu2.h into iommu.h
atmel: move ATMEL_MAX_UART to platform_data/atmel.h
ARM: OMAP: Remove omap_init_consistent_dma_size()
arm: at91: move at91rm9200 rtc header in drivers/rtc
arm: at91: move reset controller header to arm/arm/mach-at91
arm: at91: move pit define to the driver
arm: at91: move at91_shdwc.h to arch/arm/mach-at91
arm: at91: move board header to arch/arm/mach-at91
arn: at91: move at91_tc.h to arch/arm/mach-at91
arm: at91 move at91_aic.h to arch/arm/mach-at91
arm: at91 move board.h to arch/arm/mach-at91
arm: at91: move platfarm_data to include/linux/platform_data/atmel.h
arm: at91: drop machine defconfig
ARM: OMAP: Remove NEED_MACH_GPIO_H
ARM: OMAP: Remove unnecessary mach and plat includes
...
Simple bug fixes that were not considered important enough for inclusion
into 3.7, especially those that arrived late during the merge window.
There's also a MAINTAINERS update for the Renesas platforms in here,
marking Simon Horman as a maintainer and changing the git url to his tree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQx2n/AAoJEIwa5zzehBx3ciUQAKIlI8FQ/ggsfP3xYzT7y9dm
WCmfWb5OMq/kO+7vNImiKpqYUW63C+/LZlsLvXreo0YaOcodnIGr6zR822POCZ7d
76sG8isrnHL9nNfBF3qRLsNfrFV+CrBKvm1go0v6cVVp6VfpHwAk5wbNL+ET/3Kb
l3rjpFhxd5m2CMLY3ejPpwxsYXB8RJTTwz2ANYH37+aO4JwJFOmnEZoDFSSnbF5F
veNF1PQvvmiFfi18yOIDJLzF1M1tw4g8Qr5Ypno2sBQaz3lm73cBtf5hqT7XjYzk
oYgbNbR+XuJD35jtyYtV2DzUEPqZvFFTi1AL31senG+vT0HlLj+7ykQoaWs5maAb
p9Vp3pe5kpCmQgRmeVF6Pqe5aTPhbfBnfIp+mBh5Bs9CUsXsSt+X8JYpicA3lZAM
3OWq2PTkK/Cvv7RW91H2njgfoIeM5YzR3bb3EZ7JyYdDO9U+o6CMxhhNimFKq9Zk
QqbFbLPhhBrhtcpCQTp0i17dvFJkHUY3CuktcgekEx1cTCdcaisjz/5Ax+3ZOG0C
WxDMyAwQ4DP+USc9LAQkBqw3uVcoM0RqlmrXvxRtkrLEsD4msnmKoOe8zm3Q/CoD
ZWASjBMbKaLzQ7rbrmV8CdjwdAwrxexyDITRd0yKgW4IeXALLBqyLb3cqt23C6Bf
qIZtOgu1DKm1QoBM8Tgm
=Lcpo
-----END PGP SIGNATURE-----
Merge tag 'fixes-non-critical' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC Non-critical bug fixes from Olof Johansson:
"Simple bug fixes that were not considered important enough for
inclusion into 3.7, especially those that arrived late during the
merge window.
There's also a MAINTAINERS update for the Renesas platforms in here,
marking Simon Horman as a maintainer and changing the git url to his
tree."
* tag 'fixes-non-critical' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
Update ARM/SHMOBILE section of MAINTAINERS
ARM: Fix Kconfig symbols typo for LEDS
ARM: pxa: add dummy SA1100 rtc clock in pxa25x
ARM: pxa: fix pxa25x gpio wakeup setting
ARM: OMAP4: PM: fix errata handling when CONFIG_PM=n
ARM: cns3xxx: drop unnecessary symbol selection
ARM: vexpress: fix ll debug code when building multiplatform
ARM: OMAP4: retrigger localtimers after re-enabling gic
ARM: OMAP4460: Workaround for ROM bug because of CA9 r2pX GIC control register change.
ARM: OMAP4: PM: add errata support
ARM: davinci: fix return value check by using IS_ERR in tnetv107x_devices_init()
ARM: davinci: uncompress.h: bail out if uart not initialized
ARM: davinci: serial.h: fix uart number in the comment
ARM: davinci: dm644x evm: move pointer dereference below NULL check
ARM: vexpress: Make the debug UART detection more specific
Pull ARM updates from Russell King:
"Here's the updates for ARM for this merge window, which cover quite a
variety of areas.
There's a bunch of patch series from Will tackling various bugs like
the PROT_NONE handling, ASID allocation, cluster boot protocol and
ASID TLB tagging updates.
We move to a build-time sorted exception table rather than doing the
sorting at run-time, add support for the secure computing filter, and
some updates to the perf code. We also have sorted out the placement
of some headers, fixed some build warnings, fixed some hotplug
problems with the per-cpu TWD code."
* 'for-linus' of git://git.linaro.org/people/rmk/linux-arm: (73 commits)
ARM: 7594/1: Add .smp entry for REALVIEW_EB
ARM: 7599/1: head: Remove boot-time HYP mode check for v5 and below
ARM: 7598/1: net: bpf_jit_32: fix sp-relative load/stores offsets.
ARM: 7595/1: syscall: rework ordering in syscall_trace_exit
ARM: 7596/1: mmci: replace readsl/writesl with ioread32_rep/iowrite32_rep
ARM: 7597/1: net: bpf_jit_32: fix kzalloc gfp/size mismatch.
ARM: 7593/1: nommu: do not enable DCACHE_WORD_ACCESS when !CONFIG_MMU
ARM: 7592/1: nommu: prevent generation of kernel unaligned memory accesses
ARM: 7591/1: nommu: Enable the strict alignment (CR_A) bit only if ARCH < v6
ARM: 7590/1: /proc/interrupts: limit the display of IPIs to online CPUs only
ARM: 7587/1: implement optimized percpu variable access
ARM: 7589/1: integrator: pass the lm resource to amba
ARM: 7588/1: amba: create a resource parent registrator
ARM: 7582/2: rename kvm_seq to vmalloc_seq so to avoid confusion with KVM
ARM: 7585/1: kernel: fix nr_cpu_ids check in DT logical map init
ARM: 7584/1: perf: fix link error when CONFIG_HW_PERF_EVENTS is not selected
ARM: gic: use a private mapping for CPU target interfaces
ARM: kernel: add logical mappings look-up
ARM: kernel: add cpu logical map DT init in setup_arch
ARM: kernel: add device tree init map function
...
Implement destination MAC rule extension for L3/L4 rules in
flow steering. Usefull for vSwitch/macvlan configurations.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of full_mac, zero_mac in favour of
is_zero_ether_addr and is_broadcast_ether_addr.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ability to specify destination MAC address for L3/L4 flow spec
in order to be able to specify action for different VM's under vSwitch
configuration. This change is transparent to older userspace.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implents adding/deleting mdb entries via netlink.
Currently all entries are temp, we probably need a flag to distinguish
permanent entries too.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>