There are many locations that do
if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);
but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To properly handle fsync/msync in an efficient way DAX needs to track
dirty pages so it is able to flush them durably to media on demand.
The tracking of dirty pages is done via the radix tree in struct
address_space. This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file,
and it already has support for exceptional (non struct page*) entries.
We build upon these features to add exceptional entries to the radix
tree for DAX dirty PMD or PTE pages at fault time.
[dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this
function) that are marked with the PAGECACHE_TAG_TOWRITE tag.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for tracking dirty DAX entries in the struct address_space
radix tree. This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.
In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages. These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.
There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third. We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage. This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.
The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes. These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tetsuo Handa reported underflow of NR_MLOCK on munlock.
Testcase:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#define BASE ((void *)0x400000000000)
#define SIZE (1UL << 21)
int main(int argc, char *argv[])
{
void *addr;
system("grep Mlocked /proc/meminfo");
addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
-1, 0);
if (addr == MAP_FAILED)
printf("mmap() failed\n"), exit(1);
munmap(addr, SIZE);
system("grep Mlocked /proc/meminfo");
return 0;
}
It happens on munlock_vma_page() due to unfortunate choice of nr_pages
data type:
__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
For unsigned int nr_pages, implicitly casted to long in
__mod_zone_page_state(), it becomes something around UINT_MAX.
munlock_vma_page() usually called for THP as small pages go though
pagevec.
Let's make nr_pages signed int.
Similar fixes in 6cdb18ad98 ("mm/vmstat: fix overflow in
mod_zone_page_state()") used `long' type, but `int' here is OK for a
count of the number of sub-pages in a huge page.
Fixes: ff6a6da60b ("mm: accelerate munlock() treatment of THP pages")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michel Lespinasse <walken@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
ptl doesn't make much sense in this case.
Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide statistics on how much of a cgroup's memory footprint is made up
of socket buffers from network connections owned by the group.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a cgroup2 memory.stat that provides statistics on LRU memory
and fault event counters. More consumers and breakdowns will follow.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changing page->mem_cgroup of a live page is tricky and fragile. In
particular, the memcg writeback code relies on that mapping being stable
and users of mem_cgroup_replace_page() not overlapping with dirtyable
inodes.
Page cache replacement doesn't have to do that, though. Instead of being
clever and transferring the charge from the old page to the new,
force-charge the new page and leave the old page alone. A temporary
overcharge won't matter in practice, and the old page is going to be freed
shortly after this anyway. And this is not performance critical.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap cache pages are freed aggressively if swap is nearly full (>50%
currently), because otherwise we are likely to stop scanning anonymous
when we near the swap limit even if there is plenty of freeable swap cache
pages. We should follow the same trend in case of memory cgroup, which
has its own swap limit.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't scan anonymous memory if we ran out of swap, neither should we do
it in case memcg swap limit is hit, because swap out is impossible anyway.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_lruvec_online() takes lruvec, but it only needs memcg. Since
get_scan_count(), which is the only user of this function, now possesses
pointer to memcg, let's pass memcg directly to mem_cgroup_online() instead
of picking it out of lruvec and rename the function accordingly.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg will come in handy in get_scan_count(). It can already be used for
getting swappiness immediately in get_scan_count() instead of passing it
around. The following patches will add more memcg-related values, which
will be used there.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset introduces swap accounting to cgroup2.
This patch (of 7):
In the legacy hierarchy we charge memsw, which is dubious, because:
- memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.
- memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.
That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual number
of swap entries used by a cgroup. It is only charged in the unified
hierarchy, while the legacy hierarchy memsw logic is left intact.
The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.
Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero, not
just ->count, i.e. when all references to a swap entry, including the one
taken by swap cache, are gone. This is necessary, because otherwise
swap-in could result in uncharging swap even if the page is still in swap
cache and hence still occupies a swap entry. At the same time, this
shouldn't break memsw counter logic, where a page is never charged twice
for using both memory and swap, because in case of legacy hierarchy we
uncharge swap on commit (see mem_cgroup_commit_charge).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The creation and teardown of struct mem_cgroup is fairly messy and
that has attracted mistakes and subtle bugs before.
The main cause for this is that there is no clear model about what
needs to happen when, and that attracts more chaos. So create one:
1. mem_cgroup_alloc() should allocate struct mem_cgroup and its
auxiliary members and initialize work items, locks etc. so that the
object it returns is fully initialized and in a neutral state.
2. mem_cgroup_css_alloc() will use mem_cgroup_alloc() to obtain a new
memcg object and configure it and the system according to the role
of the new memory-controlled cgroup in the hierarchy.
3. mem_cgroup_css_online() is no longer needed to synchronize with
iterators, but it verifies css->id which isn't available earlier.
4. mem_cgroup_css_offline() implements stuff that needs to happen upon
the user-visible destruction of a cgroup, which includes stopping
all user interfacing as well as releasing certain structures when
continued memory consumption would be unexpected at that point.
5. mem_cgroup_css_free() prepares the system and the memcg object for
the object's disappearance, neutralizes its state, and then gives
it back to mem_cgroup_free().
6. mem_cgroup_free() releases struct mem_cgroup and auxiliary memory.
[arnd@arndb.de: fix SLOB build regression]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are no more external users of struct cg_proto, flatten the
structure into struct mem_cgroup.
Since using those struct members doesn't stand out as much anymore,
add cgroup2 static branches to make it clearer which code is legacy.
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
controller code is insignificant, having these conditionals is not
worth the complication and fragility that comes with them.
[akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
network subsys. Let's move it to memcontrol.c. This also allows us to
reuse generic code for handling legacy memcg files.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.
[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem accounting might incur overhead that some users can't put up with.
Besides, the implementation is still considered unstable. So let's
provide a way to disable it for those users who aren't happy with it.
To disable kmem accounting for cgroup2, pass cgroup.memory=nokmem at
boot time.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original cgroup memory controller has an extension to account slab
memory (and other "kernel memory" consumers) in a separate "kmem"
counter, once the user set an explicit limit on that "kmem" pool.
However, this includes various consumers whose sizes are directly linked
to userspace activity. Accounting them as an optional "kmem" extension
is problematic for several reasons:
1. It leaves the main memory interface with incomplete semantics. A
user who puts their workload into a cgroup and configures a memory
limit does not expect us to leave holes in the containment as big
as the dentry and inode cache, or the kernel stack pages.
2. If the limit set on this random historical subgroup of consumers is
reached, subsequent allocations will fail even when the main memory
pool available to the cgroup is not yet exhausted and/or has
reclaimable memory in it.
3. Calling it 'kernel memory' is misleading. The dentry and inode
caches are no more 'kernel' (or no less 'user') memory than the
page cache itself. Treating these consumers as different classes is
a historical implementation detail that should not leak to users.
So, in addition to page cache, anonymous memory, and network socket
memory, account the following memory consumers per default in the
cgroup2 memory controller:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems.
This should give us reasonable memory isolation for most common
workloads out of the box.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cgroup2 memory controller will account important in-kernel memory
consumers per default. Move all necessary components to CONFIG_MEMCG.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cgroup2 memory controller will include important in-kernel memory
consumers per default, including socket memory, but it will no longer
carry the historic tcp control interface.
Separate the kmem state init from the tcp control interface init in
preparation for that.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Put all the related code to setup and teardown the kmem accounting state
into the same location. No functional change intended.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On any given memcg, the kmem accounting feature has three separate
states: not initialized, structures allocated, and actively accounting
slab memory. These are represented through a combination of the
kmem_acct_activated and kmem_acct_active flags, which is confusing.
Convert to a kmem_state enum with the states NONE, ALLOCATED, and
ONLINE. Then rename the functions to modify the state accordingly.
This follows the nomenclature of css object states more closely.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmem page_counter's limit is initialized to PAGE_COUNTER_MAX inside
mem_cgroup_css_online(). There is no need to repeat this from
memcg_propagate_kmem().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series adds accounting of the historical "kmem" memory consumers to
the cgroup2 memory controller.
These consumers include the dentry cache, the inode cache, kernel stack
pages, and a few others that are pointed out in patch 7/8. The
footprint of these consumers is directly tied to userspace activity in
common workloads, and so they have to be part of the minimally viable
configuration in order to present a complete feature to our users.
The cgroup2 interface of the memory controller is far from complete, but
this series, along with the socket memory accounting series, provides
the final semantic changes for the existing memory knobs in the cgroup2
interface, which is scheduled for initial release in the next merge
window.
This patch (of 8):
Remove unused css argument frmo memcg_init_kmem()
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only functions doing more than one read are modified. Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UBSAN uses compile-time instrumentation to catch undefined behavior
(UB). Compiler inserts code that perform certain kinds of checks before
operations that could cause UB. If check fails (i.e. UB detected)
__ubsan_handle_* function called to print error message.
So the most of the work is done by compiler. This patch just implements
ubsan handlers printing errors.
GCC has this capability since 4.9.x [1] (see -fsanitize=undefined
option and its suboptions).
However GCC 5.x has more checkers implemented [2].
Article [3] has a bit more details about UBSAN in the GCC.
[1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
[2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
[3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/
Issues which UBSAN has found thus far are:
Found bugs:
* out-of-bounds access - 97840cb67f ("netfilter: nfnetlink: fix
insufficient validation in nfnetlink_bind")
undefined shifts:
* d48458d4a7 ("jbd2: use a better hash function for the revoke
table")
* 10632008b9 ("clockevents: Prevent shift out of bounds")
* 'x << -1' shift in ext4 -
http://lkml.kernel.org/r/<5444EF21.8020501@samsung.com>
* undefined rol32(0) -
http://lkml.kernel.org/r/<1449198241-20654-1-git-send-email-sasha.levin@oracle.com>
* undefined dirty_ratelimit calculation -
http://lkml.kernel.org/r/<566594E2.3050306@odin.com>
* undefined roundown_pow_of_two(0) -
http://lkml.kernel.org/r/<1449156616-11474-1-git-send-email-sasha.levin@oracle.com>
* [WONTFIX] undefined shift in __bpf_prog_run -
http://lkml.kernel.org/r/<CACT4Y+ZxoR3UjLgcNdUm4fECLMx2VdtfrENMtRRCdgHB2n0bJA@mail.gmail.com>
WONTFIX here because it should be fixed in bpf program, not in kernel.
signed overflows:
* 32a8df4e0b ("sched: Fix odd values in effective_load()
calculations")
* mul overflow in ntp -
http://lkml.kernel.org/r/<1449175608-1146-1-git-send-email-sasha.levin@oracle.com>
* incorrect conversion into rtc_time in rtc_time64_to_tm() -
http://lkml.kernel.org/r/<1449187944-11730-1-git-send-email-sasha.levin@oracle.com>
* unvalidated timespec in io_getevents() -
http://lkml.kernel.org/r/<CACT4Y+bBxVYLQ6LtOKrKtnLthqLHcw-BMp3aqP3mjdAvr9FULQ@mail.gmail.com>
* [NOTABUG] signed overflow in ktime_add_safe() -
http://lkml.kernel.org/r/<CACT4Y+aJ4muRnWxsUe1CMnA6P8nooO33kwG-c8YZg=0Xc8rJqw@mail.gmail.com>
[akpm@linux-foundation.org: fix unused local warning]
[akpm@linux-foundation.org: fix __int128 build woes]
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yury Gribov <y.gribov@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
record_obj() in migrate_zspage() does not preserve handle's
HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
(accidentally) un-pins the handle, while migrate_zspage() still performs
an explicit unpin_tag() on the that handle. This additional explicit
unpin_tag() introduces a race condition with zs_free(), which can pin
that handle by this time, so the handle becomes un-pinned.
Schematically, it goes like this:
CPU0 CPU1
migrate_zspage
find_alloced_obj
trypin_tag
set HANDLE_PIN_BIT zs_free()
pin_tag()
obj_malloc() -- new object, no tag
record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT
unpin_tag() -- remove zs_free's HANDLE_PIN_BIT
The race condition may result in a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
PC is at get_zspage_mapping+0x0/0x24
LR is at obj_free.isra.22+0x64/0x128
Call trace:
get_zspage_mapping+0x0/0x24
zs_free+0x88/0x114
zram_free_page+0x64/0xcc
zram_slot_free_notify+0x90/0x108
swap_entry_free+0x278/0x294
free_swap_and_cache+0x38/0x11c
unmap_single_vma+0x480/0x5c8
unmap_vmas+0x44/0x60
exit_mmap+0x50/0x110
mmput+0x58/0xe0
do_exit+0x320/0x8dc
do_group_exit+0x44/0xa8
get_signal+0x538/0x580
do_signal+0x98/0x4b8
do_notify_resume+0x14/0x5c
This patch keeps the lock bit in migration path and update value
atomically.
Signed-off-by: Junil Lee <junil0814.lee@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A newly added tracepoint in the hugepage code uses a variable in the
error handling that is not initialized at that point:
include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The result is relatively harmless, as the trace data will in rare
cases contain incorrect data.
This works around the problem by adding an explicit initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 7d2eba0557 ("mm: add tracepoint for scanning pages")
Reviewed-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a new kind of barrier, and reworks virtio and xen
to use it.
Plus some fixes here and there.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWlU2kAAoJECgfDbjSjVRpZ6IH/Ra19ecG8sCQo9zskr4zo22Z
DZXC3u0sJDBYjjBAiw3IY1FKh7wx2Fr1RhUOj1bteBgcFCMCV1zInP5ITiCyzd1H
YYh1w9C2tZaj2T4t9L4hIrAdtIF8fGS+oI2IojXPjOuDLEt6pfFBEjHp/sfl3UJq
ZmZvw4OXviSNej7jBw8Xni3Uv18yfmLGXvMdkvMSPC1/XL29voGDqTVwhqJwxLVz
k/ZLcKFOzIs9N7Nja0Jl1EiZtC2Y9cpItqweicNAzszlpkSL44vQxmCSefB+WyQ4
gt0O3+AxYkLfrxzCBhUA4IpRex3/XPW1b+1e/V1XjfR2n/FlyLe+AIa8uPJElFc=
=ukaV
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio barrier rework+fixes from Michael Tsirkin:
"This adds a new kind of barrier, and reworks virtio and xen to use it.
Plus some fixes here and there"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (44 commits)
checkpatch: add virt barriers
checkpatch: check for __smp outside barrier.h
checkpatch.pl: add missing memory barriers
virtio: make find_vqs() checkpatch.pl-friendly
virtio_balloon: fix race between migration and ballooning
virtio_balloon: fix race by fill and leak
s390: more efficient smp barriers
s390: use generic memory barriers
xen/events: use virt_xxx barriers
xen/io: use virt_xxx barriers
xenbus: use virt_xxx barriers
virtio_ring: use virt_store_mb
sh: move xchg_cmpxchg to a header by itself
sh: support 1 and 2 byte xchg
virtio_ring: update weak barriers to use virt_xxx
Revert "virtio_ring: Update weak barriers to use dma_wmb/rmb"
asm-generic: implement virt_xxx memory barriers
x86: define __smp_xxx
xtensa: define __smp_xxx
tile: define __smp_xxx
...
Commit b8d3c4c300 ("mm/huge_memory.c: don't split THP page when
MADV_FREE syscall is called") introduced this new function, but got the
error handling for when pmd_trans_huge_lock() fails wrong. In the
failure case, the lock has not been taken, and we should not unlock on
the way out.
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.
In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.
Fixed by only freeing after synchronize_rcu().
Signed-off-by: Martijn Coenen <maco@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently memory_failure() doesn't handle non anonymous thp case,
because we can hardly expect the error handling to be successful, and it
can just hit some corner case which results in BUG_ON or something
severe like that. This is also the case for soft offline code, so let's
make it in the same way.
Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
but it's unnecessary because get_any_page() is already called when
running on this code, which takes a refcount of the target page
regardress of the flag. So this patch also removes it.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
soft_offline_page() has some deeply indented code, that's the sign of
demand for cleanup. So let's do this. No functionality change.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both s390 and powerpc have hit the issue of swapoff hanging, when
CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
quite as x86_64 had them. I think it would be much clearer if
HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
determine whether the MEM_SOFT_DIRTY option should be offered, and the
actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.
But won't embark on that change myself: instead make swapoff more
robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
whether the bit in question is defined as 0 or the asm-generic fallback
is used, unless soft dirty is fully turned on.
Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov has reported[1] possible deadlock (triggered by his
syzkaller fuzzer):
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);
Both traces points to mm_take_all_locks() as a source of the problem.
It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.
huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
rest of VMAs.
The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
[1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MPOL_MF_LAZY is not visible from userspace since a720094ded ("mm:
mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
VM_HUGETLB VMAs, and avoid useless overhead of minor faults.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unused struct zone *z variable which appeared in 86051ca5ea
("mm: fix usemap initialization").
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since can_do_mlock only return 1 or 0, so make it boolean.
No functional change.
[akpm@linux-foundation.org: update declaration in mm.h]
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just cleanup, no functional change.
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use "IS_ALIGNED" to judge the alignment, rather than directly judging.
Signed-off-by: Wang Xiaoqiang <wang_xiaoq@126.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.
The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.
This patchset does not take care of the futex code. I will now look
closer at this.
This patch (of 2):
With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.
This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.
Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.
[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers
_PAGE_DEVMAP to be set in the resulting pte.
There are no functional changes to the gpu drivers as a result of this
conversion.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to this change DAX PMD mappings that were made read-only were
never able to be made writable again. This is because the code in
insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
these calls if the PMD already existed in the page table.
Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark
the PMD as read-only, and then on a subsequent write fault we get into
an infinite loop of PMD faults where we try unsuccessfully to make the
PMD writeable.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sasha Levin has reported KASAN out-of-bounds bug[1]. It points to "if
(!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.
The cause is that split_huge_page() doesn't handle THP correctly if it's
not allingned to PMD boundary. It can happen after mremap().
Test-case (not always triggers the bug):
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#define MB (1024UL*1024)
#define SIZE (2*MB)
#define BASE ((void *)0x400000000000)
int main()
{
char *p;
p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
-1, 0);
if (p == MAP_FAILED)
perror("mmap"), exit(1);
p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
BASE + SIZE + 8192);
if (p == MAP_FAILED)
perror("mremap"), exit(1);
system("echo 1 > /sys/kernel/debug/split_huge_pages");
return 0;
}
The patch fixes freeze and unfreeze paths to handle page table boundary
crossing.
It also makes mapcount vs count check in split_huge_page_to_list()
stricter:
- after freeze we don't expect any subpage mapped as we remove them
from rmap when setting up migration entries;
- count must be 1, meaning only caller has reference to the page;
[1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size. The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy. With
that, we could avoid unnecessary THP split.
For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but KSM
uses clean write-protected ptes to reference the stable ksm page. So be
sure to mark that page dirty, so it's never mistakenly discarded.
[hughd@google.com: adjusted comments]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
them so there is no value keeping them in the active anonymous LRU so
this patch moves them to inactive LRU list's head.
This means that MADV_FREE-ed pages which were living on the inactive
list are reclaimed first because they are more likely to be cold rather
than recently active pages.
An arguable issue for the approach would be whether we should put the
page to the head or tail of the inactive list. I chose head because the
kernel cannot make sure it's really cold or warm for every MADV_FREE
usecase but at least we know it's not *hot*, so landing of inactive head
would be a comprimise for various usecases.
This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
siginficat slower (ie, 2x times) than madvise_dontneed.
loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}
The reason is lots of swapin.
1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin
If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte. Instead, let's
free the cold page because swapin is more expensive than (alloc page +
zeroing).
With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).
The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.
Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation +
zeroing).
Jason Evans said:
: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE. The ones that immediately
: come to mind are redis, varnish, and MariaDB. I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX). The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.
How it works:
When madvise syscall is called, VM clears dirty bit of ptes of the
range. If memory pressure happens, VM checks dirty bit of page table
and if it found still "clean", it means it's a "lazyfree pages" so VM
could discard the page instead of swapping out. Once there was store
operation for the page before VM peek a page to reclaim, dirty bit is
set so VM can swap out the page instead of discarding.
One thing we should notice is that basically, MADV_FREE relies on dirty
bit in page table entry to decide whether VM allows to discard the page
or not. IOW, if page table entry includes marked dirty bit, VM
shouldn't discard the page.
However, as a example, if swap-in by read fault happens, page table
entry doesn't have dirty bit so MADV_FREE could discard the page
wrongly.
For avoiding the problem, MADV_FREE did more checks with PageDirty and
PageSwapCache. It worked out because swapped-in page lives on swap
cache and since it is evicted from the swap cache, the page has PG_dirty
flag. So both page flags check effectively prevent wrong discarding by
MADV_FREE.
However, a problem in above logic is that swapped-in page has PG_dirty
still after they are removed from swap cache so VM cannot consider the
page as freeable any more even if madvise_free is called in future.
Look at below example for detail.
ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and could be removed from
swapcache. Then, page table doesn't mark
dirty bit and page descriptor includes PG_dirty
..
..
madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*
To solve the problem, this patch clears PG_dirty if only the page is
owned exclusively by current process when madvise is called because
PG_dirty represents ptes's dirtiness in several processes so we could
clear it only if we own it exclusively.
Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)
barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)
Higher avg is better.
vanilla-jemalloc MADV_free-jemalloc
1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00
2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00
4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00
8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00
16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00
32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00
In summary, MADV_FREE is about much faster than MADV_DONTNEED.
This patch (of 12):
Add core MADV_FREE implementation.
[akpm@linux-foundation.org: small cleanups]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Shaohua Li" <shli@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
code for looking up pte of a (possibly transhuge) page. Move this code
to a new helper function, page_check_address_transhuge(), and make the
above mentioned functions use it.
This is just a cleanup, no functional changes are intended.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During freeze_page(), we remove the page from rmap. It munlocks the
page if it was mlocked. clear_page_mlock() uses thelru cache, which
temporary pins the page.
Let's drain the lru cache before checking page's count vs. mapcount.
The change makes mlocked page split on first attempt, if it was not
pinned by somebody else.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Writing 1 into 'split_huge_pages' will try to find and split all huge
pages in the system. This is useful for debuging.
[akpm@linux-foundation.org: fix printk text, per Vlastimil]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.
The patch removes PageTransHuge() test from the functions and opencode
page table check.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before THP refcounting rework, THP was not allowed to cross VMA
boundary. So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.
With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
- the VMA is split into two, one of them VM_LOCKED;
- huge PMD split into PTE table;
- THP is still mlocked;
3. split_huge_page():
- it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.
We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.
Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().
This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan. After the split vmscan will detect
unevictable small pages and mlock them.
With this approach we shouldn't hit situation like described above.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All parts of THP with new refcounting are now in place. We can now
allow to enable THP.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.
Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.
It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.
The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are not able to migrate THPs. It means it's not enough to split only
PMD on migration -- we need to split compound page under it too.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds implementation of split_huge_page() for new
refcountings.
Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.
The basic scheme of split_huge_page():
- Check that sum of mapcounts of all subpage is equal to page_count()
plus one (caller pin). Foll off with -EBUSY. This way we can avoid
useless PMD-splits.
- Freeze the page counters by splitting all PMD and setup migration
PTEs.
- Re-check sum of mapcounts against page_count(). Page's counts are
stable now. -EBUSY if page is pinned.
- Split compound page.
- Unfreeze the page by removing migration entries.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
changes in thp refcounting rule. This patch fixes them up.
In the new refcounting, we no longer use tail->_mapcount to keep tail's
refcount, and thereby we can simplify get/put_hwpoison_page().
And another change is that tail's refcount is not transferred to the raw
page during thp split (more precisely, in new rule we don't take
refcount on tail page any more.) So when we need thp split, we have to
transfer the refcount properly to the 4kB soft-offlined page before
migration.
thp split code goes into core code only when precheck
(total_mapcount(head) == page_count(head) - 1) passes to avoid useless
split, where we assume that one refcount is held by the caller of thp
split and the others are taken via mapping. To meet this assumption,
this patch moves thp split part in soft_offline_page() after
get_any_page().
[akpm@linux-foundation.org: remove unneeded #define, per Kirill]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I saw the following BUG_ON triggered in a testcase where a process calls
madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
calls migratepages command repeatedly (doing ping-pong among different
NUMA nodes) for the first process:
Soft offlining page 0x60000 at 0x700000600000
__get_any_page: 0x60000 free buddy page
page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1
flags: 0x1fffc0000000000()
page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
------------[ cut here ]------------
kernel BUG at /src/linux-dev/include/linux/mm.h:342!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
RIP: 0010:[<ffffffff8118998c>] [<ffffffff8118998c>] put_page+0x5c/0x60
RSP: 0018:ffff88007c213e00 EFLAGS: 00010246
Call Trace:
put_hwpoison_page+0x4e/0x80
soft_offline_page+0x501/0x520
SyS_madvise+0x6bc/0x6f0
entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
RIP [<ffffffff8118998c>] put_page+0x5c/0x60
RSP <ffff88007c213e00>
The root cause resides in get_any_page() which retries to get a refcount
of the page to be soft-offlined. This function calls
put_hwpoison_page(), expecting that the target page is putback to LRU
list. But it can be also freed to buddy. So the second check need to
care about such case.
Fixes: af8fae7c08 ("mm/memory-failure.c: clean up soft_offline_page()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.
Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to use migration PTE entries to stabilize page counts. If
the page is mapped with PMDs we need to split the PMD and setup
migration entries. It's reasonable to combine these operations to avoid
double-scanning over the page table.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.
Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).
On other hand page_mapcount() return mapcount for this particular small
page.
This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.
Most users outside core-mm should use page_mapcount() instead of
page_mapped().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to allow mapping of individual 4k pages of THP compound. It
means we need to track mapcount on per small page basis.
Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.
The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.
We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.
Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound
page with PTE we operate on ->_mapcount of the subpage.
page_mapcount() counts both: PTE and PMD mappings of the page.
Basically, we have mapcount for a subpage spread over two counters. It
makes tricky to detect when last mapcount for a page goes away.
We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.
This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.
[akpm@linux-foundation.org: fix typo in comment]
[mhocko@suse.com: ignore partial THP when moving task]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to use migration entries to stabilize page counts. It
means we don't need compound_lock() for that.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need special code to stabilize THP. If you've got reference to
any subpage of THP it will not be split under you.
New split_huge_page() also accepts tail pages: no need in special code
to get reference to head page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tail page refcounting is utterly complicated and painful to support.
It uses ->_mapcount on tail pages to store how many times this page is
pinned. get_page() bumps ->_mapcount on tail page in addition to
->_count on head. This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.
We will need ->_mapcount to account PTE mappings of subpages of the
compound page. We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.
The only user of tail page refcounting is THP which is marked BROKEN for
now.
Let's drop all this mess. It makes get_page() and put_page() much
simpler.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We will re-introduce new version with new refcounting later in patchset.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Up to this point we tried to keep patchset bisectable, but next patches
are going to change how core of THP refcounting work.
It would be beneficial to split the change into several patches and make
it more reviewable. Unfortunately, I don't see how we can achieve that
while keeping THP working.
Let's hide THP under CONFIG_BROKEN for now and bring it back when new
refcounting get established.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we
are going to be able split PMD without the compound page and that
split_huge_page() can fail.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.
khugepaged is subject for future rework wrt new refcounting.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.
With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.
I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas. But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.
We can bring something better later, but this should be good enough for
now.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.
Let's handle tail pages in gup_pte_range().
New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.
For PTE fault we can't reuse the page if it's part of huge page.
For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.
The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.
This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need
to get information from caller to know whether it was mapped with PMD or
PTE.
We do uncharge when last reference on the page gone. At that point if
we see PageTransHuge() it means we need to unchange whole huge page.
The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.
The patch adds new argument to rmap functions to indicate whether we
want to operate on whole compound page or only the small page.
[n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't define meaning of page->mapping for tail pages. Currently it's
always NULL, which can be inconsistent with head page and potentially
lead to problems.
Let's poison the pointer to catch all illigal uses.
page_rmapping(), page_mapping() and page_anon_vma() are changed to look
on head page.
The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs. do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As far as I can see there's no users of PG_reserved on compound pages.
Let's use PF_NO_COMPOUND here.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lock_page() must operate on the whole compound page. It doesn't make
much sense to lock part of compound page. Change code to use head
page's PG_locked, if tail page is passed.
This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked(). They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
helper would trigger VM_BUG_ON().
SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
appear there. VM_BUG_ON() is added to make sure that this assumption is
correct.
[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patch-bomb from Andrew Morton:
- A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
-stable
- A few misc fixes
- OCFS2 updates
- Part of MM. Including pretty large changes to page-flags handling
and to thp management which have been buffered up for 2-3 cycles now.
I have a lot of MM material this time.
[ It turns out the THP part wasn't quite ready, so that got dropped from
this series - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
zsmalloc: reorganize struct size_class to pack 4 bytes hole
mm/zbud.c: use list_last_entry() instead of list_tail_entry()
zram/zcomp: do not zero out zcomp private pages
zram: pass gfp from zcomp frontend to backend
zram: try vmalloc() after kmalloc()
zram/zcomp: use GFP_NOIO to allocate streams
mm: add tracepoint for scanning pages
drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
mm/page_isolation: use macro to judge the alignment
mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
mm: rework virtual memory accounting
include/linux/memblock.h: fix ordering of 'flags' argument in comments
mm: move lru_to_page to mm_inline.h
Documentation/filesystems: describe the shared memory usage/accounting
memory-hotplug: don't BUG() in register_memory_resource()
hugetlb: make mm and fs code explicitly non-modular
mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
mm: make sure isolate_lru_page() is never called for tail page
vmstat: make vmstat_updater deferrable again and shut down on idle
...
Reoder the pages_per_zspage field in struct size_class which can
eliminate the 4 bytes hole between it and stats field.
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
list_last_entry*( has been defined in list.h, so replace
list_tail_entry() with it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fix from Al Viro:
"Don't put symlink bodies in pagecache into highmem"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Make sure that highmem pages are not added to symlink page cache
This patch series makes swapin readahead up to a certain number to gain
more thp performance and adds tracepoint for khugepaged_scan_pmd,
collapse_huge_page, __collapse_huge_page_isolate.
This patch series was written to deal with programs that access most,
but not all, of their memory after they get swapped out. Currently
these programs do not get their memory collapsed into THPs after the
system swapped their memory out, while they would get THPs before
swapping happened.
This patch series was tested with a test program, it allocates 400MB of
memory, writes to it, and then sleeps. I force the system to swap out
all. Afterwards, the test program touches the area by writing and
leaves a piece of it without writing. This shows how much swap in
readahead made by the patch.
Test results:
After swapped out
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 90076 kB | 88064 kB | 309928 kB | %99 |
-------------------------------------------------------------------
Without patch | 194068 kB | 192512 kB | 205936 kB | %99 |
-------------------------------------------------------------------
After swapped in
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 201408 kB | 198656 kB | 198596 kB | %98 |
-------------------------------------------------------------------
Without patch | 292624 kB | 192512 kB | 107380 kB | %65 |
-------------------------------------------------------------------
This patch (of 3):
Using static tracepoints, data of functions is recorded. It is good to
automatize debugging without doing a lot of changes in the source code.
This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page
and __collapse_huge_page_isolate.
[dan.carpenter@oracle.com: add a missing tab]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.
Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:
* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas
This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:
VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)
The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.
- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Kees Cook <keescook@google.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Out of memory condition is not a bug and while we can't add new memory
in such case crashing the system seems wrong. Propagating the return
value from register_memory_resource() requires interface change.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Sheng Yong <shengyong1@huawei.com>
Cc: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Kconfig currently controlling compilation of this code is:
config HUGETLBFS
bool "HugeTLB file system support"
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that when
reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular case,
the init ordering gets moved to earlier levels when we use the more
appropriate initcalls here.
Originally I had the fs part and the mm part as separate commits, just
by happenstance of the nature of how I detected these non-modular use
cases. But that can possibly introduce regressions if the patch merge
ordering puts the fs part 1st -- as the 0-day testing reported a splat
at mount time.
Investigating with "initcall_debug" showed that the delta was
init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
both the fs change and the mm change are here together.
In addition, it worked before due to luck of link order, since they were
both in the same initcall category. So we now have the fs part using
fs_initcall, and the mm part using subsys_initcall, which puts it one
bucket earlier. It now passes the basic sanity test that failed in
earlier 0-day testing.
We delete the MODULE_LICENSE tag and capture that information at the top
of the file alongside author comments, etc.
We don't replace module.h with init.h since the file already has that.
Also note that MODULE_ALIAS is a no-op for non-modular code.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VM_BUG_ON_PAGE() would catch such cases if any still exists.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca ("vmstat: do not use deferrable delayed work for
vmstat_update"). This in turn can cause multiple interruptions of the
applications because the vmstat updater may run at
Make vmstate_update deferrable again and provide a function that folds
the differentials when the processor is going to idle mode thus
addressing the issue of the above commit in a clean way.
Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.
Fixes: ba4877b9ca ("vmstat: do not use deferrable delayed work for vmstat_update")
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A CONFIG_MEMCG=y kernel booted with "cgroup_disable=memory" crashes on a
NULL memcg (but non-NULL root_mem_cgroup) when vmpressure kicks in.
Here's the patch I use to avoid that, but you might prefer a test on
mem_cgroup_disabled() somewhere.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to <linux/jump_label.h> the direct use of struct static_key is
deprecated. Update the socket and slab accounting code accordingly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reported-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let the networking stack know when a memcg is under reclaim pressure so
that it can clamp its transmit windows accordingly.
Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
in the socket and tcp memory code that tells it to curb consumption
growth from sockets associated with said control group.
Traditionally, vmpressure reports for the entire subtree of a memcg
under pressure, which drops useful information on the individual groups
reclaimed. However, it's too late to change the userinterface, so add a
second reporting mode that reports on the level of reclaim instead of at
the level of pressure, and use that report for sockets.
vmpressure events are naturally edge triggered, so for hysteresis assert
socket pressure for a second to allow for subsequent vmpressure events
to occur before letting the socket code return to normal.
This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.
Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on the
boot commandline to override any runtime configuration and forcibly
exclude socket memory from active memory resource control.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unified hierarchy memory controller will account socket memory.
Move the infrastructure functions accordingly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unified hierarchy memory controller doesn't expose the memory+swap
counter to userspace, but its accounting is hardcoded in all charge
paths right now, including the per-cpu charge cache ("the stock").
To avoid adding yet more pointless memory+swap accounting with the
socket memory support in unified hierarchy, disable the counter
altogether when in unified hierarchy mode.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks. Move it to the memory
controller code and give it a more generic name.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and link
sockets directly to their owning memory cgroup.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.
Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.
Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.
As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.
This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A later patch will need this symbol in files other than memcontrol.c, so
export it now and replace mem_cgroup_root_css at the same time.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
list_to_page() in readahead.c is the same as lru_to_page() in vmscan.c.
So I move lru_to_page to internal.h and drop list_to_page().
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch uses is_via_compact_memory() to distinguish compaction from
sysfs or sysctl. And, this patch also reduces indentation on
compaction_defer_reset() by filtering these cases first before checking
watermark.
There is no functional change.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We already have the for_each_memblock() macro in <linux/memblock.h>
which provides ability to iterate over memblock regions of a known type.
The for_each_memblock() macro allows us to pass the pointer to the
struct memblock_type, instead we need to pass name of the type.
This patch introduces a new macro for_each_memblock_type() which allows
us iterate over memblock regions with the given type when the type is
unknown.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove rgnbase and rgnsize variables from memblock_overlaps_region().
We use these variables only for passing to the memblock_addrs_overlap()
function and that's all. Let's remove them.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_NOFAIL is a big hammer used to ensure that the allocation request
can never fail. This is a strong requirement and as such it also
deserves a special treatment when the system is OOM. The primary
problem here is that the allocation request might have come with some
locks held and the oom victim might be blocked on the same locks. This
is basically an OOM deadlock situation.
This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them to
make a progress and release resources they are holding. The OOM victim
should compensate for the reserves consumption.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry instead of list_for_each + list_entry to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To make the intention clearer, use list_{first,last}_entry instead of
list_entry.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0aaa29a56e ("mm, page_alloc: reserve pageblocks for high-order
atomic allocations on demand") added an unnecessary and unused parameter
to __rmqueue. It was a parameter that was used in an earlier version of
the patch and then left behind. This patch cleans it up.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirty balance reserve that dirty throttling has to consider is
merely memory not available to userspace allocations. There is nothing
writeback-specific about it. Generalize the name so that it's reusable
outside of that context.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is called
from the vm_operations_struct::fault() context during the page fault.
This context doesn't need the reclaim protection normally.
ceph and ocfs2 which call filemap_fault from their fault handlers seem
to be OK because they are not taking any fs lock before invoking generic
implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the
reclaim recursion POV because this lock serializes truncate and punch
hole with the page faults and it doesn't get involved in the reclaim.
There is simply no reason to deliberately use a weaker allocation
context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection
might be even harmful. There is a push to fail GFP_NOFS allocations
rather than loop within allocator indefinitely with a very limited
reclaim ability. Once we start failing those requests the OOM killer
might be triggered prematurely because the page cache allocation failure
is propagated up the page fault path and end up in
pagefault_out_of_memory.
We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.
Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
because this should be safe from the page fault path normally. Why do
we care about mapping_gfp_mask at all then? Because this doesn't hold
only reclaim protection flags but it also might contain zone and
movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
to respect those.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysctl_compaction_handler() is the handler function for compact_memory
tunable knob under /proc/sys/vm, add the missing knob name to make this
more accurate in comment.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.
The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.
The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:
http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.
The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.
When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.
This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following flag comparison in mmap_region makes no sense:
if (!(vm_flags & MAP_FIXED))
return -ENOMEM;
The condition is always false and thus the above "return -ENOMEM" is
never executed. The vm_flags must not be compared with MAP_FIXED flag.
The vm_flags may only be compared with VM_* flags. MAP_FIXED has the
same value as VM_MAYREAD.
Hitting the rlimit is a slow path and find_vma_intersection should
realize that there is no overlapping VMA for !MAP_FIXED case pretty
quickly.
Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zone_reclaimable_pages counts how many pages are reclaimable in the
given zone. This currently includes all pages on file lrus and anon
lrus if there is an available swap storage. We do not consider
NR_ISOLATED_{ANON,FILE} counters though which is not correct because
these counters reflect temporarily isolated pages which are still
reclaimable because they either get back to their LRU or get freed
either by the page reclaim or page migration.
The number of these pages might be sufficiently high to confuse users of
zone_reclaimable_pages (e.g. mbind can migrate large ranges of memory
at once).
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.
This functionality looks dubious, because it is not clear what a use
case would be. By enabling tcp accounting a user accepts the price. If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled. It does not seem
there is any need to flip it while the workload is running.
Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled. Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.
Since this API peculiarity is not documented anywhere, I propose to drop
it. This will allow to simplify the code by dropping cg_proto->flags.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We assume there is enough inactive page cache if the size of inactive
file lru is greater than the size of active file lru, in which case we
force-scan file lru ignoring anonymous pages. While this logic works
fine when there are plenty of page cache pages, it fails if the size of
file lru is small (several MB): in this case (lru_size >> prio) will be
0 for normal scan priorities, as a result, if inactive file lru happens
to be larger than active file lru, anonymous pages of a cgroup will
never get evicted unless the system experiences severe memory pressure,
even if there are gigabytes of unused anonymous memory there, which is
unfair in respect to other cgroups, whose workloads might be page cache
oriented.
This patch attempts to fix this by elaborating the "enough inactive page
cache" check: it makes it not only check that inactive lru size > active
lru size, but also that we will scan something from the cgroup at the
current scan priority. If these conditions do not hold, we proceed to
SCAN_FRACT as usual.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.
The internal accounting currently counts shmem pages together with
regular files. As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES. The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before. The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".
[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function. We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop. This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().
To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all. I time how long does it take to
cat /proc/pid/smaps of this process 100 times.
Before this patch:
real 0m3.831s
user 0m0.180s
sys 0m3.212s
After this patch:
real 0m1.176s
user 0m0.180s
sys 0m0.684s
The time is similar to the case where a radix tree iterator is employed
on the whole mapping.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).
We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.
This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible. Only for writable private
mappings of shmem objects (i.e. tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.
Hopefully such mappings are relatively uncommon.
To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.
Private writable mapping of a /dev/shm/file (the most complex case):
real 0m3.831s
user 0m0.180s
sys 0m3.212s
Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).
real 0m1.351s
user 0m0.096s
sys 0m0.768s
Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
real 0m0.935s
user 0m0.128s
sys 0m0.344s
Private anonymous mapping:
real 0m0.949s
user 0m0.116s
sys 0m0.348s
The cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To make the intention clearer, use list_{next,first}_entry instead of
list_entry.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
__GFP_NOFAIL is requested. This is fragile because we are basically
relying on somebody else to make the reclaim (be it the direct reclaim
or OOM killer) for us. The caller might be holding resources (e.g.
locks) which block other other reclaimers from making any progress for
example. Remove the retry loop and rely on __alloc_pages_slowpath to
invoke all allowed reclaim steps and retry logic.
We have to be careful about __GFP_NOFAIL allocations from the
PF_MEMALLOC context even though this is a very bad idea to begin with
because no progress can be gurateed at all. We shouldn't break the
__GFP_NOFAIL semantic here though. It could be argued that this is
essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
is much harder to check for existing users because they might happen
deep down the code path performed much later after setting the flag so
we cannot really rule out there is no kernel path triggering this
combination.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_high_priority doesn't do anything special other than it
calls get_page_from_freelist and loops around GFP_NOFAIL allocation
until it succeeds. It would be better if the first part was done in
__alloc_pages_slowpath where we modify the zonelist because this would
be easier to read and understand. Opencoding the function into its only
caller allows to simplify it a bit as well.
This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.
No functional change.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make memblock_is_memory() and memblock_is_reserved return bool to
improve readability due to these particular functions only using either
one or zero as their return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move node_id zone_idx shrink flags into trace function, so thay we don't
need caculate these args if the trace is disabled, and will make this
function have less arguments.
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, we have tracepoint in test_pages_isolated() to notify pfn which
cannot be isolated. But, in alloc_contig_range(), some error path
doesn't call test_pages_isolated() so it's still hard to know exact pfn
that causes allocation failure.
This patch change this situation by calling test_pages_isolated() in
almost error path. In allocation failure case, some overhead is added
by this change, but, allocation failure is really rare event so it would
not matter.
In fatal signal pending case, we don't call test_pages_isolated()
because this failure is intentional one.
There was a bogus outer_start problem due to unchecked buddy order and
this patch also fix it. Before this patch, it didn't matter, because
end result is same thing. But, after this patch, tracepoint will report
failed pfn so it should be accurate.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cma allocation should be guranteeded to succeed. But sometimes it can
fail in the current implementation. To track down the problem, we need
to know which page is problematic and this new tracepoint will report
it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is preparation step to report test failed pfn in new tracepoint to
analyze cma allocation failure problem. There is no functional change
in this patch.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running the SPECint_rate gcc on some very large boxes it was
noticed that the system was spending lots of time in
mpol_shared_policy_lookup(). The gamess benchmark can also show it and
is what I mostly used to chase down the issue since the setup for that I
found to be easier.
To be clear the binaries were on tmpfs because of disk I/O requirements.
We then used text replication to avoid icache misses and having all the
copies banging on the memory where the instruction code resides. This
results in us hitting a bottleneck in mpol_shared_policy_lookup() since
lookup is serialised by the shared_policy lock.
I have only reproduced this on very large (3k+ cores) boxes. The
problem starts showing up at just a few hundred ranks getting worse
until it threatens to livelock once it gets large enough. For example
on the gamess benchmark at 128 ranks this area consumes only ~1% of
time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
90%.
To alleviate the contention in this area I converted the spinlock to an
rwlock. This allows a large number of lookups to happen simultaneously.
The results were quite good reducing this consumtion at max ranks to
around 2%.
[akpm@linux-foundation.org: tidy up code comments]
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move trace_reclaim_flags() into trace function, so that we don't need
caculate these flags if the trace is disabled.
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before usage page pointer initialized by NULL is reinitialized by
follow_page_mask(). Drop useless init of page pointer in the beginning
of loop.
Signed-off-by: Alexey Klimov <klimov.linux@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make vmalloc family functions allocate vmalloc area pages with
alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
to memcg. This is needed, at least, to account alloc_fdmem allocations.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, if we want to account all objects of a particular kmem cache,
we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
to kmem_cache_create will force accounting for every allocation from
this cache even if __GFP_ACCOUNT is not passed.
This patch does not make any of the existing caches use this flag - it
will be done later in the series.
Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
hence cannot have different sets of SLAB_* flags. Thus using this flag
will probably reduce the number of merged slabs even if kmem accounting
is not used (only compiled in).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.
So this patch switches kmem accounting to the white-policy: now only
those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
memcg. Currently, no kmem allocations are marked like this. The
following patches will mark several kmem allocations that are known to
be easily triggered from userspace and therefore should be accounted to
memcg.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 8f4fc071b1 ("gfp: add __GFP_NOACCOUNT").
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.
So it was decided to switch to the white-list policy. This patch
reverts bits introducing the black-list policy. The white-list policy
will be introduced later in the series.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new helper function get_first_slab() that get the first slab from
a kmem_cache_node.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify the code with list_for_each_entry().
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify the code with list_first_entry_or_null().
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inode_nohighmem() is sufficient to make sure that page_get_link()
won't try to allocate a highmem page. Moreover, it is sufficient
to make sure that page_symlink/__page_symlink won't do the same
thing. However, any filesystem that manually preseeds the symlink's
page cache upon symlink(2) needs to make sure that the page it
inserts there won't be a highmem one.
Fortunately, only nfs and shmem have run afoul of that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cgroup updates from Tejun Heo:
- cgroup v2 interface is now official. It's no longer hidden behind a
devel flag and can be mounted using the new cgroup2 fs type.
Unfortunately, cpu v2 interface hasn't made it yet due to the
discussion around in-process hierarchical resource distribution and
only memory and io controllers can be used on the v2 interface at the
moment.
- The existing documentation which has always been a bit of mess is
relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
is added as the authoritative documentation for the v2 interface.
- Some features are added through for-4.5-ancestor-test branch to
enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
netfilter changes will be merged through the net tree which pulled in
the said branch.
- Various cleanups
* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rename cgroup documentations
cgroup: fix a typo.
cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
cgroup: demote subsystem init messages to KERN_DEBUG
cgroup: Fix uninitialized variable warning
cgroup: put controller Kconfig options in meaningful order
cgroup: clean up the kernel configuration menu nomenclature
cgroup_pids: fix a typo.
Subject: cgroup: Fix incomplete dd command in blkio documentation
cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
cpuset: Replace all instances of time_t with time64_t
cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
- Support for a separate IRQ stack, although we haven't reduced the size
of our thread stack just yet since we don't have enough data to
determine a safe value
- Refactoring of our EFI initialisation and runtime code into
drivers/firmware/efi/ so that it can be reused by arch/arm/.
- Ftrace improvements when unwinding in the function graph tracer
- Document our silicon errata handling process
- Cache flushing optimisation when mapping executable pages
- Support for hugetlb mappings using the contiguous hint in the pte
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJWj+pFAAoJELescNyEwWM0/V8IALu8i2d6LijVICyZ/MH6pK+F
krbkIjdKFmIoFqo8HolCDMDqWfdzCLW671iYmks1DYVqM0Q5SXRa1rIzMw1Nbd3s
PzHS8qvnJFGtjXgwX5yxcyA5nU5hG5/mHJ8tbEg4zlQXvGONU6rZOlt4xY3ocZR7
iWmqoNX8LbPv5UgpifQ06QXEiC+4pm/BgADl2995oZfOaZ37L6c0oh6VcxQWyEf8
7OFRYtwruNyX2S5zJkL41Rh8gFAL9/j7lrHt2D+cxHR58X+qiRYKTjxkwJUt6i3E
ROZROsdQpyHojIIIYZEfNCZWjV0NwSghQfCnbsDwxVkkVeY414UXIno8JV4MyCk=
=JHvb
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"Here is the core arm64 queue for 4.5. As you might expect, the
Christmas break resulted in a number of patches not making the final
cut, so 4.6 is likely to be larger than usual. There's still some
useful stuff here, however, and it's detailed below.
The EFI changes have been Reviewed-by Matt and the memblock change got
an "OK" from akpm.
Summary:
- Support for a separate IRQ stack, although we haven't reduced the
size of our thread stack just yet since we don't have enough data
to determine a safe value
- Refactoring of our EFI initialisation and runtime code into
drivers/firmware/efi/ so that it can be reused by arch/arm/.
- Ftrace improvements when unwinding in the function graph tracer
- Document our silicon errata handling process
- Cache flushing optimisation when mapping executable pages
- Support for hugetlb mappings using the contiguous hint in the pte"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (45 commits)
arm64: head.S: use memset to clear BSS
efi: stub: define DISABLE_BRANCH_PROFILING for all architectures
arm64: entry: remove pointless SPSR mode check
arm64: mm: move pgd_cache initialisation to pgtable_cache_init
arm64: module: avoid undefined shift behavior in reloc_data()
arm64: module: fix relocation of movz instruction with negative immediate
arm64: traps: address fallout from printk -> pr_* conversion
arm64: ftrace: fix a stack tracer's output under function graph tracer
arm64: pass a task parameter to unwind_frame()
arm64: ftrace: modify a stack frame in a safe way
arm64: remove irq_count and do_softirq_own_stack()
arm64: hugetlb: add support for PTE contiguous bit
arm64: Use PoU cache instr for I/D coherency
arm64: Defer dcache flush in __cpu_copy_user_page
arm64: reduce stack use in irq_handler
arm64: mm: ensure that the zero page is visible to the page table walker
arm64: Documentation: add list of software workarounds for errata
arm64: mm: place __cpu_setup in .text
arm64: cmpxchg: Don't incldue linux/mmdebug.h
arm64: mm: fold alternatives into .init
...
The x86 vvar vma contains pages with differing cacheability
flags. x86 currently implements this by manually inserting all
the ptes using (io_)remap_pfn_range when the vma is set up.
x86 wants to move to using .fault with VM_FAULT_NOPAGE to set up
the mappings as needed. The correct API to use to insert a pfn
in .fault is vm_insert_pfn(), but vm_insert_pfn() can't override the
vma's cache mode, and the HPET page in particular needs to be
uncached despite the fact that the rest of the VMA is cached.
Add vm_insert_pfn_prot() to support varying cacheability within
the same non-COW VMA in a more sane manner.
x86 could alternatively use multiple VMAs, but that's messy,
would break CRIU, and would create unnecessary VMAs that would
waste memory.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d2938d1eb37be7a5e4f86182db646551f11e45aa.1451446564.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Requiring special mappings to give a list of struct pages is
inflexible: it prevents sane use of IO memory in a special
mapping, it's inefficient (it requires arch code to initialize a
list of struct pages, and it requires the mm core to walk the
entire list just to figure out how long it is), and it prevents
arch code from doing anything fancy when a special mapping fault
occurs.
Add a .fault method as an alternative to filling in a .pages
array.
Looks-OK-to: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a26d1677c0bc7e774c33f469451a78ca31e9e6af.1451446564.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- make the debugfs 'kernel_page_tables' file read-only, as it only
has read ops. (Borislav Petkov)
- micro-optimize clflush_cache_range() (Chris Wilson)
- swiotlb enhancements, which fixes certain KVM emulated devices
(Igor Mammedov)
- fix an LDT related debug message (Jan Beulich)
- modularize CONFIG_X86_PTDUMP (Kees Cook)
- tone down an overly alarming warning (Laura Abbott)
- Mark variable __initdata (Rasmus Villemoes)
- PAT additions (Toshi Kani)"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Micro-optimise clflush_cache_range()
x86/mm/pat: Change free_memtype() to support shrinking case
x86/mm/pat: Add untrack_pfn_moved for mremap
x86/mm: Drop WARN from multi-BAR check
x86/LDT: Print the real LDT base address
x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN
x86/mm: Introduce max_possible_pfn
x86/mm/ptdump: Make (debugfs)/kernel_page_tables read-only
x86/mm/mtrr: Mark the 'range_new' static variable in mtrr_calc_range_state() as __initdata
x86/mm: Turn CONFIG_X86_PTDUMP into a module
Pull vfs xattr updates from Al Viro:
"Andreas' xattr cleanup series.
It's a followup to his xattr work that went in last cycle; -0.5KLoC"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr handlers: Simplify list operation
ocfs2: Replace list xattr handler operations
nfs: Move call to security_inode_listsecurity into nfs_listxattr
xfs: Change how listxattr generates synthetic attributes
tmpfs: listxattr should include POSIX ACL xattrs
tmpfs: Use xattr handler infrastructure
btrfs: Use xattr handler infrastructure
vfs: Distinguish between full xattr names and proper prefixes
posix acls: Remove duplicate xattr name definitions
gfs2: Remove gfs2_xattr_acl_chmod
vfs: Remove vfs_xattr_cmp
Pull vfs RCU symlink updates from Al Viro:
"Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
even if the symlink is not an embedded one.
No changes since the mailbomb on Jan 1"
* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->get_link() to delayed_call, kill ->put_link()
kill free_page_put_link()
teach nfs_get_link() to work in RCU mode
teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
teach shmem_get_link() to work in RCU mode
teach page_get_link() to work in RCU mode
replace ->follow_link() with new method that could stay in RCU mode
don't put symlink bodies in pagecache into highmem
namei: page_getlink() and page_follow_link_light() are the same thing
ufs: get rid of ->setattr() for symlinks
udf: don't duplicate page_symlink_inode_operations
logfs: don't duplicate page_symlink_inode_operations
switch befs long symlinks to page_symlink_operations
kernel test robot has reported the following crash:
BUG: unable to handle kernel NULL pointer dereference at 00000100
IP: [<c1074df6>] __queue_work+0x26/0x390
*pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
Workqueue: events vmstat_shepherd
task: cb684600 ti: cb7ba000 task.ti: cb7ba000
EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0
EIP is at __queue_work+0x26/0x390
EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
Stack:
Call Trace:
__queue_delayed_work+0xa1/0x160
queue_delayed_work_on+0x36/0x60
vmstat_shepherd+0xad/0xf0
process_one_work+0x1aa/0x4c0
worker_thread+0x41/0x440
kthread+0xb0/0xd0
ret_from_kernel_thread+0x21/0x40
The reason is that start_shepherd_timer schedules the shepherd work item
which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
that workqueue so if the further initialization takes more than HZ we
might end up scheduling on a NULL vmstat_wq. This is really unlikely
but not impossible.
Fixes: 373ccbe592 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mremap() with MREMAP_FIXED on a VM_PFNMAP range causes the following
WARN_ON_ONCE() message in untrack_pfn().
WARNING: CPU: 1 PID: 3493 at arch/x86/mm/pat.c:985 untrack_pfn+0xbd/0xd0()
Call Trace:
[<ffffffff817729ea>] dump_stack+0x45/0x57
[<ffffffff8109e4b6>] warn_slowpath_common+0x86/0xc0
[<ffffffff8109e5ea>] warn_slowpath_null+0x1a/0x20
[<ffffffff8106a88d>] untrack_pfn+0xbd/0xd0
[<ffffffff811d2d5e>] unmap_single_vma+0x80e/0x860
[<ffffffff811d3725>] unmap_vmas+0x55/0xb0
[<ffffffff811d916c>] unmap_region+0xac/0x120
[<ffffffff811db86a>] do_munmap+0x28a/0x460
[<ffffffff811dec33>] move_vma+0x1b3/0x2e0
[<ffffffff811df113>] SyS_mremap+0x3b3/0x510
[<ffffffff817793ee>] entry_SYSCALL_64_fastpath+0x12/0x71
MREMAP_FIXED moves a pfnmap from old vma to new vma. untrack_pfn() is
called with the old vma after its pfnmap page table has been removed,
which causes follow_phys() to fail. The new vma has a new pfnmap to
the same pfn & cache type with VM_PAT set. Therefore, we only need to
clear VM_PAT from the old vma in this case.
Add untrack_pfn_moved(), which clears VM_PAT from a given old vma.
move_vma() is changed to call this function with the old vma when
VM_PFNMAP is set. move_vma() then calls do_munmap(), and untrack_pfn()
is a no-op since VM_PAT is cleared.
Reported-by: Stas Sergeev <stsp@list.ru>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1450832064-10093-2-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Similar to memdup_user(), except that allocated buffer is one byte
longer and '\0' is stored after the copied data.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
mod_zone_page_state() takes a "delta" integer argument. delta contains
the number of pages that should be added or subtracted from a struct
zone's vm_stat field.
If a zone is larger than 8TB this will cause overflows. E.g. for a
zone with a size slightly larger than 8TB the line
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
in mm/page_alloc.c:free_area_init_core() will result in a negative
result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
contain 0x8xxxxxxx pages which will be sign extended to a negative
value.
Fix this by changing the delta argument to long type.
This could fix an early boot problem seen on s390, where we have a 9TB
system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
completely empty, so it tries to reclaim pages in an endless loop.
This was seen on a heavily patched 3.10 kernel. One possible
explaination seem to be the overflows caused by mod_zone_page_state().
Unfortunately I did not have the chance to verify that this patch
actually fixes the problem, since I don't have access to the system
right now. However the overflow problem does exist anyway.
Given the description that a system with slightly less than 8TB does
work, this seems to be a candidate for the observed problem.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range. pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.
Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.
This also prevents a crash from offlining memory devices with missing
sections. Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.
Signed-off-by: Andrew Banman <abanman@sgi.com>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup. If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.
The problem is easy to reproduce by running the following command
sequence in a loop:
mkdir /sys/fs/cgroup/memory/test
echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
memhog 150M
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
rmdir test
The cgroups generated by it will never get freed.
This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position. In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of ->css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it. This callback
is called right before scheduling rcu work which will free css, so if we
access iter->position from rcu read section, we might be sure it won't
go away under us.
[hannes@cmpxchg.org: clean up css ref handling]
Fixes: 5ac8fb31ad ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration
from subtree_control enabling") introduced the following compiler warning:
mm/memcontrol.c: In function ‘mem_cgroup_can_attach’:
mm/memcontrol.c:4790:9: warning: ‘memcg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
mc.to = memcg;
^
Fix this by initializing 'memcg' to NULL.
This was found using gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6).
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change the use of strncmp in zswap_pool_find_get() to strcmp.
The use of strncmp is no longer correct, now that zswap_zpool_type is
not an array; sizeof() will return the size of a pointer, which isn't
the right length to compare. We don't need to use strncmp anyway,
because the existing params and the passed in params are all guaranteed
to be null terminated, so strcmp should be used.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's possible that an oom killed victim shares an ->mm with the init
process and thus oom_kill_process() would end up trying to kill init as
well.
This has been shown in practice:
Out of memory: Kill process 9134 (init) score 3 or sacrifice child
Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
Kill process 1 (init) sharing same memory
...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
And this will result in a kernel panic.
If a process is forked by init and selected for oom kill while still
sharing init_mm, then it's likely this system is in a recoverable state.
However, it's better not to try to kill init and allow the machine to
panic due to unkillable processes.
[rientjes@google.com: rewrote changelog]
[akpm@linux-foundation.org: fix inverted test, per Ben]
Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov provides a little program, autogenerated by syzkaller,
which races a fault on a mapping of a sparse memfd object, against
truncation of that object below the fault address: run repeatedly for a
few minutes, it reliably generates shmem_evict_inode()'s
WARN_ON(inode->i_blocks).
(But there's nothing specific to memfd here, nor to the fstat which it
happened to use to generate the fault: though that looked suspicious,
since a shmem_recalc_inode() had been added there recently. The same
problem can be reproduced with open+unlink in place of memfd_create, and
with fstatfs in place of fstat.)
v3.7 commit 0f3c42f522 ("tmpfs: change final i_blocks BUG to WARNING")
explains one cause of such a warning (a race with shmem_writepage to
swap), and possible solutions; but we never took it further, and this
syzkaller incident turns out to have a different cause.
shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
then found to be beyond eof, looks plausible - decrementing the alloced
count that was just before incremented - but in fact can go wrong, if a
racing thread (the truncator, for example) gets its shmem_recalc_inode()
in just after our delete_from_page_cache(). delete_from_page_cache()
decrements nrpages, that shmem_recalc_inode() will balance the books by
decrementing alloced itself, then our decrement of alloced take it one
too low: leading to the WARNING when the object is finally evicted.
Once the new page has been exposed in the page cache,
shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
the accounting right in all cases (and not fall through from "trunc:" to
"decused:"). Adjust that error recovery block; and the reinitialization
of info and sbinfo can be removed too.
While we're here, fix shmem_writepage() to avoid the original issue: it
will be safe against a racing shmem_recalc_inode(), if it merely
increments swapped before the shmem_delete_from_page_cache() which
decrements nrpages (but it must then do its own shmem_recalc_inode()
before that, while still in balance, instead of after). (Aside: why do
we shmem_recalc_inode() here in the swap path? Because its raison d'etre
is to cope with clean sparse shmem pages being reclaimed behind our
back: so here when swapping is a good place to look for that case.) But
I've not now managed to reproduce this bug, even without the patch.
I don't see why I didn't do that earlier: perhaps inhibited by the
preference to eliminate shmem_recalc_inode() altogether. Driven by this
incident, I do now have a patch to do so at last; but still want to sit
on it for a bit, there's a couple of questions yet to be resolved.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov reported the following memory leak
unreferenced object 0xffff88002eaafd88 (size 32):
comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
hex dump (first 32 bytes):
28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
kmalloc include/linux/slab.h:458
region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
__vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
vma_needs_reservation mm/hugetlb.c:1813
alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
hugetlb_no_page mm/hugetlb.c:3543
hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
__get_user_pages+0x542/0xf30 mm/gup.c:497
populate_vma_page_range+0xde/0x110 mm/gup.c:919
__mm_populate+0x1c7/0x310 mm/gup.c:969
do_mlock+0x291/0x360 mm/mlock.c:637
SYSC_mlock2 mm/mlock.c:658
SyS_mlock2+0x4b/0x70 mm/mlock.c:648
Dmitry identified a potential memory leak in the routine region_chg,
where a region descriptor is not free'ed on an error path.
However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left
in the reserve map. This is "by design" as the entry should be deleted
when the map is released. The bug is in the region_del routine which is
used to delete entries within a specific range (and when the map is
released). region_del did not handle the case where a placeholder entry
exactly matched the start of the range range to be deleted. In this
case, the entry would not be deleted and leaked. The fix is to take
these special placeholder entries into account in region_del.
The region_chg error path leak is also fixed.
Fixes: feba16e25a ("mm/hugetlb: add region_del() to delete a specific range of entries")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
and check whether the obtained *ptep is a migration/hwpoison entry or
not. And if not, then we get to call huge_pte_alloc(). This is racy
because the *ptep could turn into migration/hwpoison entry after the
huge_pte_offset() check. This race results in BUG_ON in
huge_pte_alloc().
We don't have to call huge_pte_alloc() when the huge_pte_offset()
returns non-NULL, so let's fix this bug with moving the code into else
block.
Note that the *ptep could turn into a migration/hwpoison entry after
this block, but that's not a problem because we have another
!pte_present check later (we never go into hugetlb_no_page() in that
case.)
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Whoops, I missed removing the kerneldoc comment of the lrucare arg
removed from mem_cgroup_replace_page; but it's a good comment, keep it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.
The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context. If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much. WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation. This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.
In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep. In order to prevent from issues handled by 0e093d9976
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.
The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 016c13daa5 ("mm, page_alloc: use masks and shifts when
converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
wasn't updated to reflect that.
As a result, the file /proc/pagetypeinfo shows the counts for Movable as
Reclaimable and vice versa.
Additionally, commit 0aaa29a56e ("mm, page_alloc: reserve pageblocks
for high-order atomic allocations on demand") introduced
MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
show_migration_types(), so it doesn't appear in the listing of free
areas during page alloc failures or oom kills.
This patch fixes both problems. The atomic reserves will show with a
letter 'H' in the free areas listings.
Fixes: 016c13daa5 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
Fixes: 0aaa29a56e ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the memory.high threshold is exceeded, try_charge() schedules a
task_work to reclaim the excess. The reclaim target is set to the
number of pages requested by try_charge().
This is wrong, because try_charge() usually charges more pages than
requested (batch > nr_pages) in order to refill per cpu stocks. As a
result, a process in a cgroup can easily exceed memory.high
significantly when doing a lot of charges w/o returning to userspace
(e.g. reading a file in big chunks).
Fix this issue by assuring that when exceeding memory.high a process
reclaims as many pages as were actually charged (i.e. batch).
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
alloc_buddy_huge_page() to directly create a hugepage from the buddy
allocator.
In that case, however, if alloc_buddy_huge_page() succeeds we don't
decrement h->resv_huge_pages, which means that successful
hugetlb_fault() returns without releasing the reserve count. As a
result, subsequent hugetlb_fault() might fail despite that there are
still free hugepages.
This patch simply adds decrementing code on that code path.
I reproduced this problem when testing v4.3 kernel in the following situation:
- the test machine/VM is a NUMA system,
- hugepage overcommiting is enabled,
- most of hugepages are allocated and there's only one free hugepage
which is on node 0 (for example),
- another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
node 1, tries to allocate a hugepage,
- the allocation should fail but the reserve count is still hold.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This introduces the MEMBLOCK_NOMAP attribute and the required plumbing
to make it usable as an indicator that some parts of normal memory
should not be covered by the kernel direct mapping. It is up to the
arch to actually honor the attribute when laying out this mapping,
but the memblock code itself is modified to disregard these regions
for allocations and other general use.
Cc: linux-mm@kvack.org
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cgroup fixes from Tejun Heo:
"More change than I'd have liked at this stage. The pids controller
and the changes made to cgroup core to support it introduced and
revealed several important issues.
- Assigning membership to a newly created task and migrating it can
race leading to incorrect accounting. Oleg fixed it by widening
threadgroup synchronization. It looks like we'll be able to merge
it with a different percpu rwsem which is used in fork path making
things simpler and cheaper.
- The recent change to extend cgroup membership to zombies (so that
pid accounting can extend till the pid is actually released) missed
pinning the underlying data structures leading to use-after-free.
Fixed.
- v2 hierarchy was calling subsystem callbacks with the wrong target
cgroup_subsys_state based on the incorrect assumption that they
share the same target. pids is the first controller affected by
this. Subsys callbacks updated so that they can deal with
multi-target migrations"
* 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup_pids: don't account for the root cgroup
cgroup: fix handling of multi-destination migration from subtree_control enabling
cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
cgroup: make css_set pin its css's to avoid use-afer-free
cgroup: fix cftype->file_offset handling
The following commit which went into mainline through networking tree
3b13758f51 ("cgroups: Allow dynamically changing net_classid")
conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.
1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")
The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths. The latter
drops @css from cgrp_attach().
Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task. We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Nina Schiff <ninasc@fb.com>
When a file on tmpfs has an ACL or a Default ACL, listxattr should include the
corresponding xattr name.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use the VFS xattr handler infrastructure and get rid of similar code in
the filesystem. For implementing shmem_xattr_handler_set, we need a
version of simple_xattr_set which removes the attribute when value is
NULL. Use this to implement kernfs_iop_removexattr as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.
By default max_possible_pfn is initialized with max_pfn,
but later it could be updated with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: fujita.tomonori@lab.ntt.co.jp
Cc: konrad.wilk@oracle.com
Cc: pbonzini@redhat.com
Cc: revers@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge slub bulk allocator updates from Andrew Morton:
"This missed the merge window because I was waiting for some repairs to
come in. Nothing actually uses the bulk allocator yet and the changes
to other code paths are pretty small. And the net guys are waiting
for this so they can start merging the client code"
More comments from Jesper Dangaard Brouer:
"The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in
previous kernel. The present version contains a bug. Vladimir
Davydov noticed it contained a bug, when kernel is compiled with
CONFIG_MEMCG_KMEM (see commit 03ec0ed57ffc: "slub: fix kmem cgroup
bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in
kmem_cache_free_bulk() were missing (see commit 033745189b "slub:
add missing kmem cgroup support to kmem_cache_free_bulk").
I don't consider the fix stable-material because there are no in-tree
users of the API.
But with known bugs (for memcg) I cannot start using the API in the
net-tree"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
slab/slub: adjust kmem_cache_alloc_bulk API
slub: add missing kmem cgroup support to kmem_cache_free_bulk
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
slub: optimize bulk slowpath free by detached freelist
slub: support for bulk free with SLUB freelists
Adjust kmem_cache_alloc_bulk API before we have any real users.
Adjust API to return type 'int' instead of previously type 'bool'. This
is done to allow future extension of the bulk alloc API.
A future extension could be to allow SLUB to stop at a page boundary, when
specified by a flag, and then return the number of objects.
The advantage of this approach, would make it easier to make bulk alloc
run without local IRQs disabled. With an approach of cmpxchg "stealing"
the entire c->freelist or page->freelist. To avoid overshooting we would
stop processing at a slab-page boundary. Else we always end up returning
some objects at the cost of another cmpxchg.
To keep compatible with future users of this API linking against an older
kernel when using the new flag, we need to return the number of allocated
objects with this API change.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Initial implementation missed support for kmem cgroup support in
kmem_cache_free_bulk() call, add this.
If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
to not add any asm code.
Incoming bulk free objects can belong to different kmem cgroups, and
object free call can happen at a later point outside memcg context. Thus,
we need to keep the orig kmem_cache, to correctly verify if a memcg object
match against its "root_cache" (s->memcg_params.root_cache).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
be called several times inside the bulk alloc for loop, due to the call to
memcg_kmem_get_cache().
This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.
As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
to handle an array of objects.
A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
same type (size_t) as size argument. This helps the compiler to easier
realize that it can remove the loop, when all debug statements inside loop
evaluates to nothing. Note, this is only an issue because the kernel is
compiled with GCC option: -fno-strict-overflow
In slab_alloc_node() the compiler inlines and optimizes the invocation of
slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
object directly.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it possible to free a freelist with several objects by adjusting API
of slab_free() and __slab_free() to have head, tail and an objects counter
(cnt).
Tail being NULL indicate single object free of head object. This allow
compiler inline constant propagation in slab_free() and
slab_free_freelist_hook() to avoid adding any overhead in case of single
object free.
This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double in
__slab_free() and with an unlocked cmpxchg_double in slab_free().
Object debugging on the free path is also extended to handle these
freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
objects don't belong to the same slab-page.
These changes are needed for the next patch to bulk free the detached
freelists it introduces and constructs.
Micro benchmarking showed no performance reduction due to this change,
when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"A bunch of fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
slub: avoid irqoff/on in bulk allocation
slub: create new ___slab_alloc function that can be called with irqs disabled
mm: fix up sparse warning in gfpflags_allow_blocking
ocfs2: fix umask ignored issue
PM/OPP: add entry in MAINTAINERS
kernel/panic.c: turn off locks debug before releasing console lock
kernel/signal.c: unexport sigsuspend()
kasan: fix kmemleak false-positive in kasan_module_alloc()
fat: fix fake_offset handling on error path
mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes
mm/page-writeback.c: initialize m_dirty to avoid compile warning
various: fix pci_set_dma_mask return value checking
mm: loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()
tools/vm/page-types.c: support KPF_IDLE
ncpfs: don't allow negative timeouts
configfs: allow dynamic group creation
MAINTAINERS: add Moritz as reviewer for FPGA Manager Framework
slab.h: sprinkle __assume_aligned attributes
The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated
#else. For readability mark it with a comment.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new function that can do allocation while interrupts are disabled.
Avoids irq on/off sequences.
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bulk alloc needs a function like that because it enables interrupts before
calling __slab_alloc which promptly disables them again using the expensive
local_irq_save().
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmemleak reports the following leak:
unreferenced object 0xfffffbfff41ea000 (size 20480):
comm "modprobe", pid 65199, jiffies 4298875551 (age 542.568s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff82354f5e>] kmemleak_alloc+0x4e/0xc0
[<ffffffff8152e718>] __vmalloc_node_range+0x4b8/0x740
[<ffffffff81574072>] kasan_module_alloc+0x72/0xc0
[<ffffffff810efe68>] module_alloc+0x78/0xb0
[<ffffffff812f6a24>] module_alloc_update_bounds+0x14/0x70
[<ffffffff812f8184>] layout_and_allocate+0x16f4/0x3c90
[<ffffffff812faa1f>] load_module+0x2ff/0x6690
[<ffffffff813010b6>] SyS_finit_module+0x136/0x170
[<ffffffff8239bbc9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
kasan_module_alloc() allocates shadow memory for module and frees it on
module unloading. It doesn't store the pointer to allocated shadow memory
because it could be calculated from the shadowed address, i.e.
kasan_mem_to_shadow(addr).
Since kmemleak cannot find pointer to allocated shadow, it thinks that
memory leaked.
Use kmemleak_ignore() to tell kmemleak that this is not a leak and shadow
memory doesn't contain any pointers.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building kernel with gcc 5.2, the below warning is raised:
mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized]
unsigned long m_dirty, m_thresh, m_bg_thresh;
The m_dirty{thresh, bg_thresh} are initialized in the block of "if
(mdtc)", so if mdts is null, they won't be initialized before being used.
Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to
keep consistency.
They are used later by if condition: !mdtc || m_dirty <=
dirty_freerun_ceiling(m_thresh, m_bg_thresh)
If mdtc is null, dirty_freerun_ceiling will not be called at all, so the
initialization will not change any behavior other than just ceasing the
compile warning.
(akpm: the patch actually reduces .text size by ~20 bytes on gcc-4.x.y)
[akpm@linux-foundation.org: add comment]
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
hugepage but hugepage_madvise() takes the error path when we ask to turn
on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
new postcopy migration feature to fail on s390 because its first action is
to madvise the guest address space as NOHUGEPAGE. This patch modifies the
code so that the operation succeeds without error now.
For consistency reasons do the same for MADV_HUGEPAGE.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 71394fe501 ("mm: vmalloc: add flag preventing guard hole
allocation") missed a spot. Currently remove_vm_area() decreases vm->size
to "remove" the guard hole page, even when it isn't present. All but one
users just free the vm_struct rigth away and never access vm->size anyway.
Don't touch the size in remove_vm_area() and have __vunmap() use the
proper get_vm_area_size() helper.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write
Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].
Original COW locking logic was introduced by Matthew here[4].
This should be applied to v4.3 as well.
[1] 0f90cc6609 mm, dax: fix DAX deadlocks
[2] 52a2b53ffd mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978b dax: fix race between simultaneous faults
[4] 2e4cdab058 mm: allow page fault handlers to perform the COW
Cc: <stable@vger.kernel.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Merge final patch-bomb from Andrew Morton:
"Various leftovers, mainly Christoph's pci_dma_supported() removals"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
pci: remove pci_dma_supported
usbnet: remove ifdefed out call to dma_supported
kaweth: remove ifdefed out call to dma_supported
sfc: don't call dma_supported
nouveau: don't call pci_dma_supported
netup_unidvb: use pci_set_dma_mask insted of pci_dma_supported
cx23885: use pci_set_dma_mask insted of pci_dma_supported
cx25821: use pci_set_dma_mask insted of pci_dma_supported
cx88: use pci_set_dma_mask insted of pci_dma_supported
saa7134: use pci_set_dma_mask insted of pci_dma_supported
saa7164: use pci_set_dma_mask insted of pci_dma_supported
tw68-core: use pci_set_dma_mask insted of pci_dma_supported
pcnet32: use pci_set_dma_mask insted of pci_dma_supported
lib/string.c: add ULL suffix to the constant definition
hugetlb: trivial comment fix
selftests/mlock2: add ULL suffix to 64-bit constants
selftests/mlock2: add missing #define _GNU_SOURCE
In commit a1c34a3bf0 ("mm: Don't offset memmap for flatmem") Laura
fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM
IFC6410 board.
One small wrinkle on ia64 is that it allocates the node_mem_map earlier
in arch code, so it skips the block of code where "offset" is
initialized.
Move initialization of start and offset before the check for the
node_mem_map so that they will always be available in the latter part of
the function.
Tested-by: Laura Abbott <laura@labbott.name>
Fixes: a1c34a3bf0 (mm: Don't offset memmap for flatmem)
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
Pull trivial updates from Jiri Kosina:
"Trivial stuff from trivial tree that can be trivially summed up as:
- treewide drop of spurious unlikely() before IS_ERR() from Viresh
Kumar
- cosmetic fixes (that don't really affect basic functionality of the
driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek
- various comment / printk fixes and updates all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
bcache: Really show state of work pending bit
hwmon: applesmc: fix comment typos
Kconfig: remove comment about scsi_wait_scan module
class_find_device: fix reference to argument "match"
debugfs: document that debugfs_remove*() accepts NULL and error values
net: Drop unlikely before IS_ERR(_OR_NULL)
mm: Drop unlikely before IS_ERR(_OR_NULL)
fs: Drop unlikely before IS_ERR(_OR_NULL)
drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
UBI: Update comments to reflect UBI_METAONLY flag
pktcdvd: drop null test before destroy functions
Let's try to be consistent about data type of page order.
[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:
CPU0 CPU1
isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
<head == NULL dereferencing>
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.
We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.
The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.
The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.
hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.
The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.
That means page->compound_head shares storage space with:
- page->lru.next;
- page->next;
- page->rcu_head.next;
That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.
page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().
[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch halves space occupied by compound_dtor and compound_order in
struct page.
For compound_order, it's trivial long -> short conversion.
For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.
This patch free up one word in tail pages for reuse. This is preparation
for the next patch.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to rework how compound_head() work. It will not use
page->first_page as we have it now.
The only other user of page->first_page beyond compound pages is
zsmalloc.
Let's use page->private instead of page->first_page here. It occupies
the same storage space.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each `struct size_class' contains `struct zs_size_stat': an array of
NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no
CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned
long)' per-class.
The patch removes unneeded `struct zs_size_stat' members by redefining
NR_ZS_STAT_TYPE (max stat idx in array).
Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants,
GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type
larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at
the moment.
./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39)
function old new delta
fix_fullness_group 97 94 -3
insert_zspage 100 86 -14
remove_zspage 141 119 -22
To summarize:
a) each class now uses less memory
b) we avoid a number of dec/inc stats (a minor optimization,
but still).
The gain will increase once we introduce additional stats.
A simple IO test.
iozone -t 4 -R -r 32K -s 60M -I +Z
patched base
" Initial write " 4145599.06 4127509.75
" Rewrite " 4146225.94 4223618.50
" Read " 17157606.00 17211329.50
" Re-read " 17380428.00 17267650.50
" Reverse Read " 16742768.00 16162732.75
" Stride read " 16586245.75 16073934.25
" Random read " 16349587.50 15799401.75
" Mixed workload " 10344230.62 9775551.50
" Random write " 4277700.62 4260019.69
" Pwrite " 4302049.12 4313703.88
" Pread " 6164463.16 6126536.72
" Fwrite " 7131195.00 6952586.00
" Fread " 12682602.25 12619207.50
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't let user to disable shrinker in zsmalloc (once it's been
enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(),
at the moment at least.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A cosmetic change.
Commit c60369f011 ("staging: zsmalloc: prevent mappping in interrupt
context") added in_interrupt() check to zs_map_object() and 'hardirq.h'
include; but in_interrupt() macro is defined in 'preempt.h' not in
'hardirq.h', so include it instead.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In obj_malloc():
if (!class->huge)
/* record handle in the header of allocated chunk */
link->handle = handle;
else
/* record handle in first_page->private */
set_page_private(first_page, handle);
In the hugepage we save handle to private directly.
But in obj_to_head():
if (class->huge) {
VM_BUG_ON(!is_first_page(page));
return *(unsigned long *)page_private(page);
} else
return *(unsigned long *)obj;
It is used as a pointer.
The reason why there is no problem until now is huge-class page is born
with ZS_FULL so it can't be migrated. However, we need this patch for
future work: "VM-aware zsmalloced page migration" to reduce external
fragmentation.
Signed-off-by: Hui Zhu <zhuhui@xiaomi.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the return type of zpool_get_type const; the string belongs to the
zpool driver and should not be modified. Remove the redundant type field
in the struct zpool; it is private to zpool.c and isn't needed since
->driver->type can be used directly. Add comments indicating strings must
be null-terminated.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of using a fixed-length string for the zswap params, use charp.
This simplifies the code and uses less memory, as most zswap param strings
will be less than the current maximum length.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the next line entry variable will be re-initialized so no need to init
it with NULL.
Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "vma" parameter to khugepaged_alloc_page() is unused. It has to
remain unused or the drop read lock 'map_sem' optimisation introduce by
commit 8b1645685a ("mm, THP: don't hold mmap_sem in khugepaged when
allocating THP") wouldn't be safe. So let's remove it.
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew stated the following
We have quite a history of remote parts of the kernel using
weird/wrong/inexplicable combinations of __GFP_ flags. I tend
to think that this is because we didn't adequately explain the
interface.
And I don't think that gfp.h really improved much in this area as
a result of this patchset. Could you go through it some time and
decide if we've adequately documented all this stuff?
This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.
High-order watermarks serve a different purpose. Kswapd had no high-order
awareness before they were introduced
(https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was
particularly important when there were high-order atomic requests. The
watermarks both gave kswapd awareness and made a reserve for those atomic
requests.
There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are
available and the order-0 watermarks are ok. The second is that
high-order watermark checks are expensive as the free list counts up to
the requested order must be examined.
With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable
high-order page is free.
With the patch applied, there was little difference in the allocation
failure rates as the atomic reserves are small relative to the number of
allocation attempts. The expected impact is that there will never be an
allocation failure report that shows suitable pages on the free lists.
The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them. If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.
[akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
High-order watermark checking exists for two reasons -- kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
high-order free pages for as long as possible. This patch introduces
MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
allocations on demand and avoids using those blocks for order-0
allocations. This is more flexible and reliable than MIGRATE_RESERVE was.
A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
allocation request steals a pageblock but limits the total number to 1% of
the zone. Callers that speculatively abuse atomic allocations for
long-lived high-order allocations to access the reserve will quickly fail.
Note that SLUB is currently not such an abuser as it reclaims at least
once. It is possible that the pageblock stolen has few suitable
high-order pages and will need to steal again in the near future but there
would need to be strong justification to search all pageblocks for an
ideal candidate.
The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.
The watermark checks account for the reserved pageblocks when the
allocation request is not a high-order atomic allocation.
The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock.
This is deliberate as the intent of the reservation is to satisfy a
limited number of atomic high-order short-lived requests if the system
requires them.
The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic. It is intended to stress random high-order allocations from an
unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations. The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure.
The allocation failures once the workload warmed up were as follows;
4.2-rc5-vanilla 70%
4.2-rc5-atomic-reserve 56%
The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.
Overall, this is a small reduction but the reserves are small relative to
the number of allocation requests. In early versions of the patch, the
failure rate reduced by a much larger amount but that required much larger
reserves and perversely made atomic allocations seem more reliable than
regular allocations.
[yalin.wang2010@gmail.com: fix redundant check and a memory leak]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MIGRATE_RESERVE preserves an old property of the buddy allocator that
existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
tended to remain contiguous until the only alternative was to fail the
allocation. At the time it was discovered that high-order atomic
allocations relied on this property so MIGRATE_RESERVE was introduced. A
later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
Note that this patch in isolation may look like a false regression if
someone was bisecting high-order atomic allocation failures.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.
1) The cpuset checks are no-ops unless a cpuset is active and in general
are a lot cheaper.
2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.
3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost
of the watermark checking.
4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.
The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it. If stalls due
to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.
The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticeable difference as a result of the removal.
The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency. On a 2-node machine the results of this patch are
stutter
4.3.0-rc1 4.3.0-rc1
baseline nozlc-v4
Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)
Note that the maximum stall latency went from 24 seconds to 12 which is
still bad but an improvement. The milage varies considerably 2-node
machine on an earlier test went from 494 seconds to 47 seconds and a
4-node machine that tested an earlier version of this patch went from a
worst case stall time of 6 seconds to 67ms. The nature of the benchmark
is inherently unpredictable as it is hammering the system and the milage
will vary between machines.
There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed
1. Direct reclaim rates were higher. This was likely due to direct reclaim
being entered instead of the zlc disabling a zone and busy looping.
Busy looping may have the effect of allowing kswapd to make more
progress and in some cases may be better overall. If this is found then
the correct action is to put direct reclaimers to sleep on a waitqueue
and allow kswapd make forward progress. Busy looping on the zlc is even
worse than when the allocator used to blindly call congestion_wait().
2. There was higher swap activity as direct reclaim was active.
3. Direct reclaim efficiency was lower. This is related to 1 as more
scanning activity also encountered more pages that could not be
immediately reclaimed
In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons
1. The test is primarily concerned with latency. The mmap attempts are also
faulted which means there are THP allocation requests. The ZLC could
cause zones to be disabled causing the process to busy loop instead
of reclaiming. This looks like elevated direct reclaim activity but
it's the correct action to take based on what processes requested.
2. The test hammers reclaim and compaction heavily. The number of successful
THP faults is highly variable but affects the reclaim stats. It's not a
realistic or reasonable measure of page reclaim activity.
3. No other page-reclaim intensive workload that was tested showed a problem.
4. If a workload is identified that benefitted from the busy looping then it
should be fixed by having direct reclaimers sleep on a wait queue until
woken by kswapd instead of busy looping. We had this class of problem before
when congestion_waits() with a fixed timeout was a brain damaged decision
but happened to benefit some workloads.
If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>