Commit Graph

12444 Commits

Author SHA1 Message Date
Xidong Wang 1ec6995d12 z3fold: fix memory leak
In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
released on the error path that pool->compact_wq , which holds the
return value of create_singlethread_workqueue(), is NULL.  This will
result in a memory leak bug.

[akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
Signed-off-by: Xidong Wang <wangxidong_97@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Michal Hocko 2a70f6a76b memcg, thp: do not invoke oom killer on thp charges
A THP memcg charge can trigger the oom killer since 2516035499 ("mm,
thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
We have used an explicit __GFP_NORETRY previously which ruled the OOM
killer automagically.

Memcg charge path should be semantically compliant with the allocation
path and that means that if we do not trigger the OOM killer for costly
orders which should do the same in the memcg charge path as well.
Otherwise we are forcing callers to distinguish the two and use
different gfp masks which is both non-intuitive and bug prone.  As soon
as we get a costly high order kmalloc user we even do not have any means
to tell the memcg specific gfp mask to prevent from OOM because the
charging is deep within guts of the slab allocator.

The unexpected memcg OOM on THP has already been fixed upstream by
9d3c3354bb ("mm, thp: do not cause memcg oom for thp") but this is a
one-off fix rather than a generic solution.  Teach mem_cgroup_oom to
bail out on costly order requests to fix the THP issue as well as any
other costly OOM eligible allocations to be added in future.

Also revert 9d3c3354bb because special gfp for THP is no longer
needed.

Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
Fixes: 2516035499 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Ralph Campbell 07707125ae mm/migrate: properly preserve write attribute in special migrate entry
Use of pte_write(pte) is only valid for present pte, the common code
which set the migration entry can be reach for both valid present pte
and special swap entry (for device memory).  Fix the code to use the
mpfn value which properly handle both cases.

On x86 this did not have any bad side effect because pte write bit is
below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
in turn means we were always creating read only special migration entry.

So once migration did finish we always write protected the CPU page
table entry (moreover this is only an issue when migrating from device
memory to system memory).  End effect is that CPU write access would
fault again and restore write permission.

This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
take a second fault on write access. ie, double faulting the same
address.  There is no corruption or incorrect states (it behaves as a
COWed page from a fork with a mapcount of 1).

Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Mel Gorman 09a913a7a9 sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages
change_pte_range is called from task work context to mark PTEs for
receiving NUMA faulting hints.  If the marked pages are dirty then
migration may fail.  Some filesystems cannot migrate dirty pages without
blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
Even when they can, it can be a waste of cycles when the pages are
shared forcing higher scan rates.  This patch avoids marking shared
dirty pages for hinting faults but also will skip a migration if the
page was dirtied after the scanner updated a clean page.

This is most noticeable running the NASA Parallel Benchmark when backed
by btrfs, the default root filesystem for some distributions, but also
noticeable when using XFS.

The following are results from a 4-socket machine running a 4.16-rc4
kernel with some scheduler patches that are pending for the next merge
window.

                        4.16.0-rc4             4.16.0-rc4
                 schedtip-20180309          nodirty-v1
  Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
  Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
  Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
  Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
  Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)

is.D regresses slightly in terms of absolute time but note that that
particular load varies quite a bit from run to run.  The more relevant
observation is the total system CPU usage.

            4.16.0-rc4  4.16.0-rc4
          schedtip-20180309 nodirty-v1
  User        71471.91    70627.04
  System      11078.96     8256.13
  Elapsed       661.66      632.74

That is a substantial drop in system CPU usage and overall the workload
completes faster.  The NUMA balancing statistics are also interesting

  NUMA base PTE updates        111407972   139848884
  NUMA huge PMD updates           206506      264869
  NUMA page range updates      217139044   275461812
  NUMA hint faults               4300924     3719784
  NUMA hint local faults         3012539     3416618
  NUMA hint local percent             70          91
  NUMA pages migrated            1517487     1358420

While more PTEs are scanned due to changes in what faults are gathered,
it's clear that a far higher percentage of faults are local as the bulk
of the remote hits were dirty pages that, in this case with btrfs, had
no chance of migrating.

The following is a comparison when using XFS as that is a more realistic
filesystem choice for a data partition

                        4.16.0-rc4             4.16.0-rc4
                 schedtip-20180309          nodirty-v1r47
  Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
  Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
  Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
  Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
  Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)

That is a reasonable gain on two relatively long-lived workloads.  While
not presented, there is also a substantial drop in system CPu usage and
the NUMA balancing stats show similar improvements in locality as btrfs
did.

Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Tejun Heo 18be460eeb mm/hmm.c: remove superfluous RCU protection around radix tree lookup
hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
actually uses the RCU protection.  The only caller is
hmm_devmem_pages_create() which already grabs the mutex and does
superfluous rcu_read_lock/unlock() around the function.

This doesn't add anything and just adds to confusion.  Remove the RCU
protection and open-code the radix tree lookup.  If this needs to become
more sophisticated in the future, let's add them back when necessary.

Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse f88a1e90c6 mm/hmm: use device driver encoding for HMM pfn
Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
pfn shift value allowing them to define their own encoding for HMM pfn
that are fill inside the pfns array of the hmm_range struct.  With this
device driver can get pfn that match their own private encoding out of HMM
without having to do any conversion.

[rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
  Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
[rcampbell@nvidia.com: clarify fault logic for device private memory]
  Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse 2aee09d8c1 mm/hmm: change hmm_vma_fault() to allow write fault on page basis
This changes hmm_vma_fault() to not take a global write fault flag for a
range but instead rely on caller to populate HMM pfns array with proper
fault flag ie HMM_PFN_VALID if driver want read fault for that address or
HMM_PFN_VALID and HMM_PFN_WRITE for write.

Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
device private memory to be migrated back to system memory through page
fault.

This is more flexible API and it better reflects how device handles and
reports fault.

Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse 53f5c3f489 mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
No functional change, just create one function to handle pmd and one to
handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse 33cd47dcbb mm/hmm: move hmm_pfns_clear() closer to where it is used
Move hmm_pfns_clear() closer to where it is used to make it clear it is
not use by page table walkers.

Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse b2744118a6 mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
Make naming consistent across code, DEVICE_PRIVATE is the name use outside
HMM code so use that one.

Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 5504ed2969 mm/hmm: do not differentiate between empty entry or missing directory
There is no point in differentiating between a range for which there is
not even a directory (and thus entries) and empty entry (pte_none() or
pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 855ce7d252 mm/hmm: cleanup special vma handling (VM_SPECIAL)
Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model across device drivers on those
vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault() and
hmm_vma_get_pfns() returns -EINVAL when facing such vma.  Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
were filling the HMM pfn array with special entry.

Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse ff05c0c6bb mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong
All device driver we care about are using 64bits page table entry.  In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t.  It is a first step on the road to allow driver to
directly use pfn value return by HMM (saving memory and CPU cycles use for
conversion between the two).

Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 86586a41b8 mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
Only peculiar architecture allow write without read thus assume that any
valid pfn do allow for read.  Note we do not care for write only because
it does make sense with thing like atomic compare and exchange or any
other operations that allow you to get the memory value through them.

Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse 08232a4544 mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
as parameter and were initializing that struct with others of their
parameters.  Have caller of those function do this as they are likely to
already do and only pass this struct to both function this shorten
function signature and make it easier in the future to add new parameters
by simply adding them to the structure.

Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse c719547f03 mm/hmm: hmm_pfns_bad() was accessing wrong struct
The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired.  Fix to get proper struct pointer.

Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Jérôme Glisse c01cbba2aa mm/hmm: unregister mmu_notifier when last HMM client quit
This code was lost in translation at one point.  This properly call
mmu_notifier_unregister_no_release() once last user is gone.  This fix the
zombie mm_struct as without this patch we do not drop the refcount we have
on it.

Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Ralph Campbell e1401513c6 mm/hmm: HMM should have a callback before MM is destroyed
hmm_mirror_register() registers a callback for when the CPU pagetable is
modified.  Normally, the device driver will call hmm_mirror_unregister()
when the process using the device is finished.  However, if the process
exits uncleanly, the struct_mm can be destroyed with no warning to the
device driver.

Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Steven Rostedt d51d1e6450 mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure.  This
structure is currently local to mm/vmscan.c.  By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event.  In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).

Before this patch, the code to call the trace event is this:

 0f 83 aa fe ff ff       jae    ffffffff811e6261 <shrink_inactive_list+0x1e1>
 48 8b 45 a0             mov    -0x60(%rbp),%rax
 45 8b 64 24 20          mov    0x20(%r12),%r12d
 44 8b 6d d4             mov    -0x2c(%rbp),%r13d
 8b 4d d0                mov    -0x30(%rbp),%ecx
 44 8b 75 cc             mov    -0x34(%rbp),%r14d
 44 8b 7d c8             mov    -0x38(%rbp),%r15d
 48 89 45 90             mov    %rax,-0x70(%rbp)
 8b 83 b8 fe ff ff       mov    -0x148(%rbx),%eax
 8b 55 c0                mov    -0x40(%rbp),%edx
 8b 7d c4                mov    -0x3c(%rbp),%edi
 8b 75 b8                mov    -0x48(%rbp),%esi
 89 45 80                mov    %eax,-0x80(%rbp)
 65 ff 05 e4 f7 e2 7e    incl   %gs:0x7ee2f7e4(%rip)        # 15bd0 <__preempt_count>
 48 8b 05 75 5b 13 01    mov    0x1135b75(%rip),%rax        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
 48 85 c0                test   %rax,%rax
 74 72                   je     ffffffff811e646a <shrink_inactive_list+0x3ea>
 48 89 c3                mov    %rax,%rbx
 4c 8b 10                mov    (%rax),%r10
 89 f8                   mov    %edi,%eax
 48 89 85 68 ff ff ff    mov    %rax,-0x98(%rbp)
 89 f0                   mov    %esi,%eax
 48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
 89 c8                   mov    %ecx,%eax
 48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
 89 d0                   mov    %edx,%eax
 48 89 85 70 ff ff ff    mov    %rax,-0x90(%rbp)
 8b 45 8c                mov    -0x74(%rbp),%eax
 48 8b 7b 08             mov    0x8(%rbx),%rdi
 48 83 c3 18             add    $0x18,%rbx
 50                      push   %rax
 41 54                   push   %r12
 41 55                   push   %r13
 ff b5 78 ff ff ff       pushq  -0x88(%rbp)
 41 56                   push   %r14
 41 57                   push   %r15
 ff b5 70 ff ff ff       pushq  -0x90(%rbp)
 4c 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%r9
 4c 8b 85 60 ff ff ff    mov    -0xa0(%rbp),%r8
 48 8b 4d 98             mov    -0x68(%rbp),%rcx
 48 8b 55 90             mov    -0x70(%rbp),%rdx
 8b 75 80                mov    -0x80(%rbp),%esi
 41 ff d2                callq  *%r10

After the patch:

 0f 83 a8 fe ff ff       jae    ffffffff811e626d <shrink_inactive_list+0x1cd>
 8b 9b b8 fe ff ff       mov    -0x148(%rbx),%ebx
 45 8b 64 24 20          mov    0x20(%r12),%r12d
 4c 8b 6d a0             mov    -0x60(%rbp),%r13
 65 ff 05 f5 f7 e2 7e    incl   %gs:0x7ee2f7f5(%rip)        # 15bd0 <__preempt_count>
 4c 8b 35 86 5b 13 01    mov    0x1135b86(%rip),%r14        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
 4d 85 f6                test   %r14,%r14
 74 2a                   je     ffffffff811e6411 <shrink_inactive_list+0x371>
 49 8b 06                mov    (%r14),%rax
 8b 4d 8c                mov    -0x74(%rbp),%ecx
 49 8b 7e 08             mov    0x8(%r14),%rdi
 49 83 c6 18             add    $0x18,%r14
 4c 89 ea                mov    %r13,%rdx
 45 89 e1                mov    %r12d,%r9d
 4c 8d 45 b8             lea    -0x48(%rbp),%r8
 89 de                   mov    %ebx,%esi
 51                      push   %rcx
 48 8b 4d 98             mov    -0x68(%rbp),%rcx
 ff d0                   callq  *%rax

Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Andrey Ryabinin e3c1ac586c mm/vmscan: don't mess with pgdat->flags in memcg reclaim
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages.  But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested().  Note that only kswapd
have powers to clear any of these bits.  This might just never happen if
cgroup limits configured that way.  So all direct reclaims will stall as
long as we have some congested bdi in the system.

Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.

Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags.  I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set.  If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee.  E.g.  direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).

Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism.  Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.

Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.

The problem could be easily demonstrated by creating heavy congestion in
one cgroup:

    echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
    mkdir -p /sys/fs/cgroup/congester
    echo 512M > /sys/fs/cgroup/congester/memory.max
    echo $$ > /sys/fs/cgroup/congester/cgroup.procs
    /* generate a lot of diry data on slow HDD */
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
    ....
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &

and some job in another cgroup:

    mkdir /sys/fs/cgroup/victim
    echo 128M > /sys/fs/cgroup/victim/memory.max

    # time cat /dev/sda > /dev/null
    real    10m15.054s
    user    0m0.487s
    sys     1m8.505s

According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.

With the patch, cat don't waste time anymore:

    # time cat /dev/sda > /dev/null
    real    5m32.911s
    user    0m0.411s
    sys     0m56.664s

[aryabinin@virtuozzo.com: congestion state should be per-node]
  Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
[ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
  Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Andrey Ryabinin d108c7721f mm/vmscan: don't change pgdat state on base of a single LRU list state
We have separate LRU list for each memory cgroup.  Memory reclaim
iterates over cgroups and calls shrink_inactive_list() every inactive
LRU list.  Based on the state of a single LRU shrink_inactive_list() may
flag the whole node as dirty,congested or under writeback.  This is
obviously wrong and hurtful.  It's especially hurtful when we have
possibly small congested cgroup in system.  Than *all* direct reclaims
waste time by sleeping in wait_iff_congested().  And the more memcgs in
the system we have the longer memory allocation stall is, because
wait_iff_congested() called on each lru-list scan.

Sum reclaim stats across all visited LRUs on node and flag node as
dirty, congested or under writeback based on that sum.  Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.

This only fixes the problem for global reclaim case.  Per-cgroup reclaim
may alter global pgdat flags too, which is wrong.  But that is separate
issue and will be addressed in the next patch.

This change will not have any effect on a systems with all workload
concentrated in a single cgroup.

[aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
  Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:30 -07:00
Andrey Ryabinin c4fd4fa580 mm/vmscan: remove redundant current_may_throttle() check
Only kswapd can have non-zero nr_immediate, and current_may_throttle()
is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.

Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Andrey Ryabinin 894befec4d mm/vmscan: update stale comments
Update some comments that became stale since transiton from per-zone to
per-node reclaim.

Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Roman Gushchin d79f7aa496 mm: treat indirectly reclaimable memory as free in overcommit logic
Indirectly reclaimable memory can consume a significant part of total
memory and it's actually reclaimable (it will be released under actual
memory pressure).

So, the overcommit logic should treat it as free.

Otherwise, it's possible to cause random system-wide memory allocation
failures by consuming a significant amount of memory by indirectly
reclaimable memory, e.g.  dentry external names.

If overcommit policy GUESS is used, it might be used for denial of
service attack under some conditions.

The following program illustrates the approach.  It causes the kernel to
allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
that at some point the overcommit logic may start blocking large
allocation system-wide.

  int main()
  {
  	char buf[256];
  	unsigned long i;
  	struct stat statbuf;

  	buf[0] = '/';
  	for (i = 1; i < sizeof(buf); i++)
  		buf[i] = '_';

  	for (i = 0; 1; i++) {
  		sprintf(&buf[248], "%8lu", i);
  		stat(buf, &statbuf);
  	}

  	return 0;
  }

This patch in combination with related indirectly reclaimable memory
patches closes this issue.

Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Roman Gushchin 034ebf65c3 mm: treat indirectly reclaimable memory as available in MemAvailable
Adjust /proc/meminfo MemAvailable calculation by adding the amount of
indirectly reclaimable memory (rounded to the PAGE_SIZE).

Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Roman Gushchin eb59254608 mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES
Patch series "indirectly reclaimable memory", v2.

This patchset introduces the concept of indirectly reclaimable memory
and applies it to fix the issue of when a big number of dentries with
external names can significantly affect the MemAvailable value.

This patch (of 3):

Introduce a concept of indirectly reclaimable memory and adds the
corresponding memory counter and /proc/vmstat item.

Indirectly reclaimable memory is any sort of memory, used by the kernel
(except of reclaimable slabs), which is actually reclaimable, i.e.  will
be released under memory pressure.

The counter is in bytes, as it's not always possible to count such
objects in pages.  The name contains BYTES by analogy to
NR_KERNEL_STACK_KB.

Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:29 -07:00
Linus Torvalds 49a695ba72 powerpc updates for 4.17
Notable changes:
 
  - Support for 4PB user address space on 64-bit, opt-in via mmap().
 
  - Removal of POWER4 support, which was accidentally broken in 2016 and no one
    noticed, and blocked use of some modern instructions.
 
  - Workarounds so that the hypervisor can enable Transactional Memory on Power9.
 
  - A series to disable the DAWR (Data Address Watchpoint Register) on Power9.
 
  - More information displayed in the meltdown/spectre_v1/v2 sysfs files.
 
  - A vpermxor (Power8 Altivec) implementation for the raid6 Q Syndrome.
 
  - A big series to make the allocation of our pacas (per cpu area), kernel page
    tables, and per-cpu stacks NUMA aware when using the Radix MMU on Power9.
 
 And as usual many fixes, reworks and cleanups.
 
 Thanks to:
   Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy, Alistair Popple, Andy
   Shevchenko, Aneesh Kumar K.V, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Christophe Leroy, Christophe Lombard, Cyril Bur, Daniel Axtens,
   Dave Young, Finn Thain, Frederic Barrat, Gustavo Romero, Horia Geantă,
   Jonathan Neuschäfer, Kees Cook, Larry Finger, Laurent Dufour, Laurent Vivier,
   Logan Gunthorpe, Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus
   Elfring, Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de
   Oliveira, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
   Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher Boessenkool,
   Simon Guo, Simon Horman, Stewart Smith, Sukadev Bhattiprolu, Suraj Jitindar
   Singh, Thiago Jung Bauermann, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant
   Hegde, Wei Yongjun.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJayKxDExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAr
 JQ/6A9Xs4zHDn9OeT9esEIxciETqUlrP0Wp64c4JVC7EkG1E7xRDZ4Xb4m8R2nNt
 9sPhtNO1yCtEk6kFQtPNB0N8v6pud4I6+aMcYnn+tP8mJRYQ4x9bYaF3Hw98IKmE
 Kd6TglmsUQvh2GpwPiF93KpzzWu1HB2kZzzqJcAMTMh7C79Qz00BjrTJltzXB2jx
 tJ+B4lVy8BeU8G5nDAzJEEwb5Ypkn8O40rS/lpAwVTYOBJ8Rbyq8Fj82FeREK9YO
 4EGaEKPkC/FdzX7OJV3v2/nldCd8pzV471fAoGuBUhJiJBMBoBybcTHIdDex7LlL
 zMLV1mUtGo8iolRPhL8iCH+GGifZz2WzstYCozz7hgIraWtc/frq9rZp6q0LdH/K
 trk7UbPGlVb92ecWZVpZyEcsMzKrCgZqnAe9wRNh1uEKScEdzd/bmRaMhENUObRh
 Hili6AVvmSKExpy7k2sZP/oUMaeC15/xz8Lk7l8a/iCkYhNmPYh5iSXM5+UKpcRT
 FYOcO0o3DwXsN46Whow3nJ7TqAsDy9/ecPUG71JQi3ZrHnRrm8jxkn8MCG5pZ1Fi
 KvKDxlg6RiJo3DF9/fSOpJUokvMwqBS5dJo4eh5eiDy94aBTqmBKFecvPxQm7a0L
 l3uXCF/6JuXEvMukFjGBO4RiYhw8i+B2uKsh81XUh7HKrgE=
 =HAB1
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Support for 4PB user address space on 64-bit, opt-in via mmap().

   - Removal of POWER4 support, which was accidentally broken in 2016
     and no one noticed, and blocked use of some modern instructions.

   - Workarounds so that the hypervisor can enable Transactional Memory
     on Power9.

   - A series to disable the DAWR (Data Address Watchpoint Register) on
     Power9.

   - More information displayed in the meltdown/spectre_v1/v2 sysfs
     files.

   - A vpermxor (Power8 Altivec) implementation for the raid6 Q
     Syndrome.

   - A big series to make the allocation of our pacas (per cpu area),
     kernel page tables, and per-cpu stacks NUMA aware when using the
     Radix MMU on Power9.

  And as usual many fixes, reworks and cleanups.

  Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy,
  Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
  Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic
  Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook,
  Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe,
  Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring,
  Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira,
  Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
  Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher
  Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev
  Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav
  Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun"

* tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits)
  powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep
  powerpc/64s: Fix POWER9 DD2.2 and above in cputable features
  powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
  powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits
  Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead"
  powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo}
  powerpc: io.h: move iomap.h include so that it can use readq/writeq defs
  cxl: Fix possible deadlock when processing page faults from cxllib
  powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
  powerpc/mm/radix: Update command line parsing for disable_radix
  powerpc/mm/radix: Parse disable_radix commandline correctly.
  powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
  powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
  powerpc/mm/keys: Update documentation and remove unnecessary check
  powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead
  powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop()
  powerpc/powernv: Always stop secondaries before reboot/shutdown
  powerpc: hard disable irqs in smp_send_stop loop
  powerpc: use NMI IPI for smp_send_stop
  powerpc/powernv: Fix SMT4 forcing idle code
  ...
2018-04-07 12:08:19 -07:00
Linus Torvalds 3b54765cca Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - the v9fs maintainers have been missing for a long time. I've taken
   over v9fs patch slinging.

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
  mm,oom_reaper: check for MMF_OOM_SKIP before complaining
  mm/ksm: fix interaction with THP
  mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
  headers: untangle kmemleak.h from mm.h
  include/linux/mmdebug.h: make VM_WARN* non-rvals
  mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
  mm: change return type to vm_fault_t
  mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
  mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
  kernel/fork.c: detect early free of a live mm
  mm: make counting of list_lru_one::nr_items lockless
  mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
  block_invalidatepage(): only release page if the full page was invalidated
  mm: kernel-doc: add missing parameter descriptions
  mm/swap.c: remove @cold parameter description for release_pages()
  mm/nommu: remove description of alloc_vm_area
  zram: drop max_zpage_size and use zs_huge_class_size()
  zsmalloc: introduce zs_huge_class_size()
  mm: fix races between swapoff and flush dcache
  fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
  ...
2018-04-06 14:19:26 -07:00
Tetsuo Handa 97b1255cb2 mm,oom_reaper: check for MMF_OOM_SKIP before complaining
I got "oom_reaper: unable to reap pid:" messages when the victim thread
was blocked inside free_pgtables() (which occurred after returning from
unmap_vmas() and setting MMF_OOM_SKIP).  We don't need to complain when
exit_mmap() already set MMF_OOM_SKIP.

  Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: unable to reap pid:7558 (a.out)
  a.out           D13272  7558   6931 0x00100084
  Call Trace:
   schedule+0x2d/0x80
   rwsem_down_write_failed+0x2bb/0x440
   call_rwsem_down_write_failed+0x13/0x20
   down_write+0x49/0x60
   unlink_file_vma+0x28/0x50
   free_pgtables+0x36/0x100
   exit_mmap+0xbb/0x180
   mmput+0x50/0x110
   copy_process.part.41+0xb61/0x1fe0
   _do_fork+0xe6/0x560
   do_syscall_64+0x74/0x230
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Claudio Imbrenda 77da2ba064 mm/ksm: fix interaction with THP
This patch fixes a corner case for KSM.  When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.

This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
  e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
  e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
  e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans

The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)

The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page.  This means that
split_huge_page will find a value of the reference counter too high and
will fail.

This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them.  If so, the
hugepage is split safely.  This means that the hugepage is not split if
not necessary.

Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Stefan Agner 644d87dccd mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
This fixes a warning shown when phys_addr_t is 32-bit int when compiling
with clang:

  mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
        to 'phys_addr_t' (aka 'unsigned int') changes value from
        18446744073709551615 to 4294967295 [-Wconstant-conversion]
                                  r->base : ULLONG_MAX;
                                            ^~~~~~~~~~
  ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
  #define ULLONG_MAX      (~0ULL)
                           ^~~~~

Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Randy Dunlap 514c603249 headers: untangle kmemleak.h from mm.h
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
reason.  It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
don't already #include it.  Also remove <linux/kmemleak.h> from source
files that do not use it.

This is tested on i386 allmodconfig and x86_64 allmodconfig.  It would
be good to run it through the 0day bot for other $ARCHes.  I have
neither the horsepower nor the storage space for the other $ARCHes.

Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms.  Both of them reported 2 build failures
for which patches are included here (in v2).

[ slab.h is the second most used header file after module.h; kernel.h is
  right there with slab.h. There could be some minor error in the
  counting due to some #includes having comments after them and I didn't
  combine all of those. ]

[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>	[2 build failures]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>	[2 build failures]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Mike Kravetz 2c7452a075 mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
start_isolate_page_range() is used to set the migrate type of a set of
pageblocks to MIGRATE_ISOLATE while attempting to start a migration
operation.  It assumes that only one thread is calling it for the
specified range.  This routine is used by CMA, memory hotplug and
gigantic huge pages.  Each of these users synchronize access to the
range within their subsystem.  However, two subsystems (CMA and gigantic
huge pages for example) could attempt operations on the same range.  If
this happens, one thread may 'undo' the work another thread is doing.
This can result in pageblocks being incorrectly left marked as
MIGRATE_ISOLATE and therefore not available for page allocation.

What is ideally needed is a way to synchronize access to a set of
pageblocks that are undergoing isolation and migration.  The only thing
we know about these pageblocks is that they are all in the same zone.  A
per-node mutex is too coarse as we want to allow multiple operations on
different ranges within the same zone concurrently.  Instead, we will
use the migration type of the pageblocks themselves as a form of
synchronization.

start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest pfn to
the largest pfn.  The zone lock is acquired to check and set the
migration type.  When going through the list of pageblocks check if
MIGRATE_ISOLATE is already set.  If so, this indicates another thread is
working on this pageblock.  We know exactly which pageblocks we set, so
clean up by undo those and return -EBUSY.

This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making use of
these interfaces.  Update comments in alloc_contig_range to reflect this
new functionality.

Each CPU holds the associated zone lock to modify or examine the
migration type of a pageblock.  And, it will only examine/update a
single pageblock per lock acquire/release cycle.

Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
David Rientjes d46078b288 mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
Since the 2.6 kernel, the oom killer has slightly biased away from
CAP_SYS_ADMIN processes by discounting some of its memory usage in
comparison to other processes.

This has always been implicit and nothing exactly relies on the
behavior.

Gaurav notices that __task_cred() can dereference a potentially freed
pointer if the task under consideration is exiting because a reference
to the task_struct is not held.

Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.

If any CAP_SYS_ADMIN process would like to be biased against, it is
always allowed to adjust /proc/pid/oom_score_adj.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
David Rientjes 5ecd9d403a mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.

This can be true if there is a lot of free memory available.  For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction.  If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.

When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim.  This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Kirill Tkhai 0c7c1bed7e mm: make counting of list_lru_one::nr_items lockless
During the reclaiming slab of a memcg, shrink_slab iterates over all
registered shrinkers in the system, and tries to count and consume
objects related to the cgroup.  In case of memory pressure, this behaves
bad: I observe high system time and time spent in list_lru_count_one()
for many processes on RHEL7 kernel.

This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
skip taking spinlock in list_lru_count_one().

Shakeel Butt with the patch observes significant perf graph change.  He
says:

========================================================================
Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.

The trace without the patch:

+  34.19%     fb.sh  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
+  30.77%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_lock
+   3.53%     fb.sh  [kernel.kallsyms]  [k] list_lru_count_one
+   2.26%     fb.sh  [kernel.kallsyms]  [k] super_cache_count
+   1.68%     fb.sh  [kernel.kallsyms]  [k] shrink_slab
+   0.59%     fb.sh  [kernel.kallsyms]  [k] down_read_trylock
+   0.48%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
+   0.38%     fb.sh  [kernel.kallsyms]  [k] shrink_node_memcg
+   0.32%     fb.sh  [kernel.kallsyms]  [k] queue_work_on
+   0.26%     fb.sh  [kernel.kallsyms]  [k] count_shadow_nodes

With the patch:

+   0.16%     swapper  [kernel.kallsyms]    [k] default_idle
+   0.13%     oom_reaper  [kernel.kallsyms]    [k] mutex_spin_on_owner
+   0.05%     perf  [kernel.kallsyms]    [k] copy_user_generic_string
+   0.05%     init.real  [kernel.kallsyms]    [k] wait_consider_task
+   0.05%     kworker/0:0  [kernel.kallsyms]    [k] finish_task_switch
+   0.04%     kworker/2:1  [kernel.kallsyms]    [k] finish_task_switch
+   0.04%     kworker/3:1  [kernel.kallsyms]    [k] finish_task_switch
+   0.04%     kworker/1:0  [kernel.kallsyms]    [k] finish_task_switch
+   0.03%     binary  [kernel.kallsyms]    [k] copy_page
========================================================================

Thanks Shakeel for the testing.

[ktkhai@virtuozzo.com: v2]
  Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Colin Ian King f5c754d63d mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
The bool enable_vma_readahead and swap_vma_readahead() are local to the
source and do not need to be in global scope, so make them static.

Cleans up sparse warnings:

  mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
  mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Mike Rapoport e8b098fc57 mm: kernel-doc: add missing parameter descriptions
Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Mike Rapoport 002843de36 mm/swap.c: remove @cold parameter description for release_pages()
The 'cold' parameter was removed from release_pages function by commit
c6f92f9fbe ("mm: remove cold parameter for release_pages").

Update the description to match the code.

Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Mike Rapoport e48e3c590a mm/nommu: remove description of alloc_vm_area
The alloc_mm_area in nommu is a stub, but its description states it
allocates kernel address space.  Remove the description to make the code
and the documentation agree.

Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Sergey Senozhatsky 010b495e2f zsmalloc: introduce zs_huge_class_size()
Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.

ZRAM's max_zpage_size is a bad thing.  It forces zsmalloc to store
normal objects as huge ones, which results in bigger zsmalloc memory
usage.  Drop it and use actual zsmalloc huge-class value when decide if
the object is huge or not.

This patch (of 2):

Not every object can be share its zspage with other objects, e.g.  when
the object is as big as zspage or nearly as big a zspage.  For such
objects zsmalloc has a so called huge class - every object which belongs
to huge class consumes the entire zspage (which consists of a physical
page).  On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
3264, so starting down from size 3264, objects can share page(-s) and
thus minimize memory wastage.

ZRAM, however, has its own statically defined watermark for huge
objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
object larger than this watermark (3072) as a PAGE_SIZE object, in other
words, to a huge class, while zsmalloc can keep some of those objects in
non-huge classes.  This results in increased memory consumption.

zsmalloc knows better if the object is huge or not.  Introduce
zs_huge_class_size() function which tells if the given object can be
stored in one of non-huge classes or not.  This will let us to drop
ZRAM's huge object watermark and fully rely on zsmalloc when we decide
if the object is huge.

[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
  Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Huang Ying cb9f753a37 mm: fix races between swapoff and flush dcache
Thanks to commit 4b3ef9daa4 ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed.  So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.

The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function.  But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,

  CPU1					CPU2
  __get_user_pages()			swapoff()
    flush_dcache_page()
      mapping = page_mapping()
        ...				  exit_swap_address_space()
        ...				    kvfree(spaces)
        mapping_mapped(mapping)

The address space may be accessed after being freed.

But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used.  The implementation of flush_dcache_page() in all architectures
follows this too.  They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately.  And they will use interval tree (mapping->i_mmap)
to find all user space mappings.  While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.

So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise.  All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().

[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dan Williams 05ea88608d mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct
When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and report the MMU page mapping size that is being enforced by
the vma.

Similar to commit 31383c6865 "mm, hugetlbfs: introduce ->split() to
vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
about device-dax page mapping sizes in the same (hstate) way that
hugetlbfs communicates this attribute.  Instead, these patches introduce
a new ->pagesize() vm operation.

Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dan Williams 09135cc594 mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize()
Patch series "mm, smaps: MMUPageSize for device-dax", v3.

Similar to commit 31383c6865 ("mm, hugetlbfs: introduce ->split() to
vm_operations_struct") here is another occasion where we want
special-case hugetlbfs/hstate enabling to also apply to device-dax.

This prompts the question what other hstate conversions we might do
beyond ->split() and ->pagesize(), but this appears to be the last of
the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.

This patch (of 3):

The current powerpc definition of vma_mmu_pagesize() open codes looking
up the page size via hstate.  It is identical to the generic
vma_kernel_pagesize() implementation.

Now, vma_kernel_pagesize() is growing support for determining the page
size of Device-DAX vmas in addition to the existing Hugetlbfs page size
determination.

Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
would automatically benefit from any new vma-type support that is added
to vma_kernel_pagesize().  However, the powerpc vma_mmu_pagesize() is
prevented from calling vma_kernel_pagesize() due to a circular header
dependency that requires vma_mmu_pagesize() to be defined before
including <linux/hugetlb.h>.

Break this circular dependency by defining the default vma_mmu_pagesize()
as a __weak symbol to be overridden by the powerpc version.

Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Mario Leinweber 2923117b71 mm/gup.c: fix coding style issues.
- Fixed style error: 8 spaces -> 1 tab.
- Fixed style warning: Corrected misleading indentation.

Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
Signed-off-by: Mario Leinweber <marioleinweber@web.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Aaron Lu 97334162e4 mm/free_pcppages_bulk: prefetch buddy while not holding lock
When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge.  This requires accessing buddy's
page structure and that access could take a long time if it's cache
cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster.  Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e.  the prefetched data
will never be unused.

Normally, the number of prefetch will be pcp->batch(default=31 and has
an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
pcp's pages get all drained, it will be pcp->count which has an upper
limit of pcp->high.  pcp->high, although has a default value of 186
(pcp->batch=31 * 6), can be changed by user through
/proc/sys/vm/percpu_pagelist_fraction and there is no software upper
limit so could be large, like several thousand.  For this reason, only
the first pcp->batch number of page's buddy structure is prefetched to
avoid excessive prefetching.

In the meantime, there are two concerns:

 1. the prefetch could potentially evict existing cachelines, especially
    for L1D cache since it is not huge

 2. there is some additional instruction overhead, namely calculating
    buddy pfn twice

For 1, it's hard to say, this microbenchmark though shows good result
but the actual benefit of this patch will be workload/CPU dependant;

For 2, since the calculation is a XOR on two local variables, it's
expected in many cases that cycles spent will be offset by reduced
memory latency later.  This is especially true for NUMA machines where
multiple CPUs are contending on zone->lock and the most time consuming
part under zone->lock is the wait of 'struct page' cacheline of the
to-be-freed pages and their buddies.

Test with will-it-scale/page_fault1 full load:

  kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
  v4.16-rc2+  9034215        7971818       13667135       15677465
  patch2/3    9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
  this patch 10180856 +6.8%  8506369 +2.3% 14756865 +4.9% 17325324 +3.9%

Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)

[aaron.lu@intel.com: use helper function, avoid disordering pages]
  Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
  Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
[aaron.lu@intel.com: v4]
  Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
  Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Aaron Lu 0a5f4e5b45 mm/free_pcppages_bulk: do not hold lock when picking pages to free
When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
zone->lock is held and then pages are chosen from PCP's migratetype
list.  While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance.  Test with will-it-scale/page_fault1 full load:

  kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
  v4.16-rc2+  9034215        7971818       13667135       15677465
  this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:

 - mmap 128M anonymouse space

 - write access to that space

 - munmap.

The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Aaron Lu 77ba9062e4 mm/free_pcppages_bulk: update pcp->count inside
Matthew Wilcox found that all callers of free_pcppages_bulk() currently
update pcp->count immediately after so it's natural to do it inside
free_pcppages_bulk().

No functionality or performance change is expected from this patch.

Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
David Rientjes bc3106b26c mm, compaction: drain pcps for zone when kcompactd fails
It's possible for free pages to become stranded on per-cpu pagesets
(pcps) that, if drained, could be merged with buddy pages on the zone's
free area to form large order pages, including up to MAX_ORDER.

Consider a verbose example using the tools/vm/page-types tool at the
beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
a slab page).  Pages on pcps do not have any page flags set.

  109954  1       _______S________________________________________________________
  109955  2       __________B_____________________________________________________
  109957  1       ________________________________________________________________
  109958  1       __________B_____________________________________________________
  109959  7       ________________________________________________________________
  109960  1       __________B_____________________________________________________
  109961  9       ________________________________________________________________
  10996a  1       __________B_____________________________________________________
  10996b  3       ________________________________________________________________
  10996e  1       __________B_____________________________________________________
  10996f  1       ________________________________________________________________
  ...
  109f8c  1       __________B_____________________________________________________
  109f8d  2       ________________________________________________________________
  109f8f  2       __________B_____________________________________________________
  109f91  f       ________________________________________________________________
  109fa0  1       __________B_____________________________________________________
  109fa1  7       ________________________________________________________________
  109fa8  1       __________B_____________________________________________________
  109fa9  1       ________________________________________________________________
  109faa  1       __________B_____________________________________________________
  109fab  1       _______S________________________________________________________

The compaction migration scanner is attempting to defragment this memory
since it is at the beginning of the zone.  It has done so quite well,
all movable pages have been migrated.  From pfn [0x109955, 0x109fab),
there are only buddy pages and pages without flags set.

These pages may be stranded on pcps that could otherwise allow this
memory to be coalesced if freed back to the zone free area.  It is
possible that some of these pages may not be on pcps and that something
has called alloc_pages() and used the memory directly, but we rely on
the absence of __GFP_MOVABLE in these cases to allocate from
MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
pageblocks as free as possible.

These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
allow for three transparent hugepages to be dynamically allocated.
Running the numbers for all such spans on the system, it was found that
there were over 400 such spans of only buddy pages and pages without
flags set at the time this /proc/kpageflags sample was collected.
Without this support, there were _no_ order-9 or order-10 pages free.

When kcompactd fails to defragment memory such that a cc.order page can
be allocated, drain all pcps for the zone back to the buddy allocator so
this stranding cannot occur.  Compaction for that order will
subsequently be deferred, which acts as a ratelimit on this drain.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Howard McLauchlan 4f6923fbb3 mm: make should_failslab always available for fault injection
should_failslab() is a convenient function to hook into for directed
error injection into kmalloc().  However, it is only available if a
config flag is set.

The following BCC script, for example, fails kmalloc() calls after a
btrfs umount:

    from bcc import BPF

    prog = r"""
    BPF_HASH(flag);

    #include <linux/mm.h>

    int kprobe__btrfs_close_devices(void *ctx) {
            u64 key = 1;
            flag.update(&key, &key);
            return 0;
    }

    int kprobe__should_failslab(struct pt_regs *ctx) {
            u64 key = 1;
            u64 *res;
            res = flag.lookup(&key);
            if (res != 0) {
                bpf_override_return(ctx, -ENOMEM);
            }
            return 0;
    }
    """
    b = BPF(text=prog)

    while 1:
        b.kprobe_poll()

This patch refactors the should_failslab implementation so that the
function is always available for error injection, independent of flags.

This change would be similar in nature to commit f5490d3ec921 ("block:
Add should_fail_bio() for bpf error injection").

Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dou Liyang 14298d3663 mm/page_poison.c: make early_page_poison_param() __init
The early_param() is only called during kernel initialization, So Linux
marks the function of it with __init macro to save memory.

But it forgot to mark the early_page_poison_param().  So, Make it __init
as well.

Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dou Liyang 1173194e1e mm/page_owner.c: make early_page_owner_param() __init
The early_param() is only called during kernel initialization, So Linux
marks the functions of it with __init macro to save memory.

But it forgot to mark the early_page_owner_param().  So, Make it __init
as well.

Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Dou Liyang 8bd30c1090 mm/kmemleak.c: make kmemleak_boot_config() __init
The early_param() is only called during kernel initialization, So Linux
marks the functions of it with __init macro to save memory.

But it forgot to mark the kmemleak_boot_config().  So, Make it __init as
well.

Link: http://lkml.kernel.org/r/20180117034720.26897-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Minchan Kim e9e9b7ecee mm: swap: unify cluster-based and vma-based swap readahead
This patch makes do_swap_page() not need to be aware of two different
swap readahead algorithms.  Just unify cluster-based and vma-based
readahead function call.

Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Minchan Kim eaf649ebc3 mm: swap: clean up swap readahead
When I see recent change of swap readahead, I am very unhappy about
current code structure which diverges two swap readahead algorithm in
do_swap_page.  This patch is to clean it up.

Main motivation is that fault handler doesn't need to be aware of
readahead algorithms but just should call swapin_readahead.

As first step, this patch cleans up a little bit but not perfect (I just
separate for review easier) so next patch will make the goal complete.

[minchan@kernel.org: do not check readahead flag with THP anon]
  Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
  Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Tetsuo Handa e830c63a62 mm,vmscan: don't pretend forward progress upon shrinker_rwsem contention
Since we no longer use return value of shrink_slab() for normal reclaim,
the comment is no longer true.  If some do_shrink_slab() call takes
unexpectedly long (root cause of stall is currently unknown) when
register_shrinker()/unregister_shrinker() is pending, trying to drop
caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
loop if many mem_cgroup are defined.  For safety, let's not pretend
forward progress.

Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Vitaly Wool 5c9bab592f z3fold: limit use of stale list for allocation
Currently if z3fold couldn't find an unbuddied page it would first try
to pull a page off the stale list.  The problem with this approach is
that we can't 100% guarantee that the page is not processed by the
workqueue thread at the same time unless we run cancel_work_sync() on
it, which we can't do if we're in an atomic context.  So let's just
limit stale list usage to non-atomic contexts only.

Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com
Signed-off-by: Vitaly Vul <vitaly.vul@sony.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <Oleksiy.Avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Konstantin Khlebnikov 605ca5ede7 mm/huge_memory.c: reorder operations in __split_huge_page_tail()
THP split makes non-atomic change of tail page flags.  This is almost ok
because tail pages are locked and isolated but this breaks recent
changes in page locking: non-atomic operation could clear bit
PG_waiters.

As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever.  Especially if this page was truncated later.

Fix is trivial: clone flags before unfreezing page reference counter.

This race exists since commit 6290602709 ("mm: add PageWaiters
indicating tasks are waiting for a page bit") while unsave unfreeze
itself was added in commit 8df651c705 ("thp: cleanup
split_huge_page()").

clear_compound_head() also must be called before unfreezing page
reference because after successful get_page_unless_zero() might follow
put_page() which needs correct compound_head().

And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().

Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Huang Ying e92bb4dd96 mm: fix races between address_space dereference and free in page_evicatable
When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time.  This may lead to the following race.

CPU1                                                CPU2

truncate(inode)                                     shrink_active_list()
  ...                                                 page_evictable(page)
  truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
      spin_lock_irqsave(&mapping->tree_lock, flags);
        __delete_from_page_cache(page, NULL)
          page_cache_tree_delete(..)
            ...                                         mapping = page_mapping(page);
            page->mapping = NULL;
            ...
      spin_unlock_irqrestore(&mapping->tree_lock, flags);
      page_cache_free_page(mapping, page)
        put_page(page)
          if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

                                                        mapping_unevictable(mapping)
							  test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.

Similar race exists between swap cache freeing and page_evicatable()
too.

The address_space in inode and swap cache will be freed after a RCU
grace period.  So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock().  Some comments are
added in code to make it clear what is protected by the RCU read lock.

Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Andy Shevchenko 5ad3509364 mm: reuse DEFINE_SHOW_ATTRIBUTE() macro
...instead of open coding file operations followed by custom ->open()
callbacks per each attribute.

[andriy.shevchenko@linux.intel.com: add tags, fix compilation issue]
  Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com
Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
David Rientjes 7f16f91fdf mm, page_alloc: move mirrored_kernelcore to __meminitdata
mirrored_kernelcore can be in __meminitdata, so move it there.

At the same time, fixup section specifiers to be after the name of the
variable per checkpatch.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121623280.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
David Rientjes a5c6d65093 mm, page_alloc: extend kernelcore and movablecore for percent
Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively.  This requires
the system memory capacity to be known when specifying the command line,
however.

This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory.  This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.

To define the percentage, the final character of the parameter should be
a '%'.

mhocko: "why is anyone using these options nowadays?"

rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example.  When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.

[rientjes@google.com: updates]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Naoya Horiguchi 31286a8484 mm: hwpoison: disable memory error handling on 1GB hugepage
Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage.  This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.

I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it.  We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.

[n-horiguchi@ah.jp.nec.com: v2]
  Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Pavel Tatashin d0dc12e86b mm/memory_hotplug: optimize memory hotplug
During memory hotplugging we traverse struct pages three times:

1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
   SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()

This patch removes the first two loops, and leaves only loop 3.  All
struct pages are initialized in one place, the same as it is done during
boot.

The benefits:

 - We improve memory hotplug performance because we are not evicting the
   cache several times and also reduce loop branching overhead.

 - Remove condition from hotpath in __init_single_pfn(), that was added
   in order to fix the problem that was reported by Bharata in the above
   email thread, thus also improve performance during normal boot.

 - Make memory hotplug more similar to the boot memory initialization
   path because we zero and initialize struct pages only in one
   function.

 - Simplifies memory hotplug struct page initialization code, and thus
   enables future improvements, such as multi-threading the
   initialization of struct pages in order to improve hotplug
   performance even further on larger machines.

[pasha.tatashin@oracle.com: v5]
  Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Pavel Tatashin fc44f7f923 mm/memory_hotplug: don't read nid from struct page during hotplug
During memory hotplugging the probe routine will leave struct pages
uninitialized, the same as it is currently done during boot.  Therefore,
we do not want to access the inside of struct pages before
__init_single_page() is called during onlining.

Because during hotplug we know that pages in one memory block belong to
the same numa node, we can skip the checking.  We should keep checking
for the boot case.

[pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
  Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Pavel Tatashin f165b378bb mm: uninitialized struct page poisoning sanity checking
During boot we poison struct page memory in order to ensure that no one
is accessing this memory until the struct pages are initialized in
__init_single_page().

This patch adds more scrutiny to this checking by making sure that flags
do not equal the poison pattern when they are accessed.  The pattern is
all ones.

Since node id is also stored in struct page, and may be accessed quite
early, we add this enforcement into page_to_nid() function as well.
Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n

[pasha.tatashin@oracle.com: v4]
  Link: http://lkml.kernel.org/r/20180215165920.8570-4-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180213193159.14606-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Pavel Tatashin ba32558523 mm/memory_hotplug: enforce block size aligned range check
Patch series "optimize memory hotplug", v3.

This patchset:

 - Improves hotplug performance by eliminating a number of struct page
   traverses during memory hotplug.

 - Fixes some issues with hotplugging, where boundaries were not
   properly checked. And on x86 block size was not properly aligned with
   end of memory

 - Also, potentially improves boot performance by eliminating condition
   from __init_single_page().

 - Adds robustness by verifying that that struct pages are correctly
   poisoned when flags are accessed.

The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:

booting in qemu with 960G of memory, time to initialize struct pages:

no-kvm:
	TRY1		TRY2
BEFORE:	39.433668	39.39705
AFTER:	36.903781	36.989329

with-kvm:
BEFORE:	10.977447	11.103164
AFTER:	10.929072	10.751885

Hotplug 896G memory:
no-kvm:
	TRY1		TRY2
BEFORE: 848.740000	846.910000
AFTER:  783.070000	786.560000

with-kvm:
	TRY1		TRY2
BEFORE: 34.410000	33.57
AFTER:	29.810000	29.580000

This patch (of 6):

Start qemu with the following arguments:

  -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.

Also make sure that config has the following turned on:
  CONFIG_MEMORY_HOTPLUG
  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
  CONFIG_ACPI_HOTPLUG_MEMORY

Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1

The operation will fail with the following trace:

    WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
    pages_correctly_reserved+0xe6/0x110
    Modules linked in:
    CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
    BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:pages_correctly_reserved+0xe6/0x110
    Call Trace:
     memory_subsys_online+0x44/0xa0
     device_online+0x51/0x80
     store_mem_state+0x5e/0xe0
     kernfs_fop_write+0xfa/0x170
     __vfs_write+0x2e/0x150
     vfs_write+0xa8/0x1a0
     SyS_write+0x4d/0xb0
     do_syscall_64+0x5d/0x110
     entry_SYSCALL_64_after_hwframe+0x21/0x86
    ---[ end trace 6203bc4f1a5d30e8 ]---

The problem is detected in: drivers/base/memory.c

   static bool pages_correctly_reserved(unsigned long start_pfn)
   205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))

This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.

The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000

or

   $ dmesg | grep "block size"
   [    0.086469] x86/mm: Memory block size: 2048MB

During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations.  See:
Documentation/memory-hotplug.txt

So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.

In our case the start_pfn starts from the last_pfn (end of physical
memory).

   $ dmesg | grep last_pfn
   [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

0x1040000 == 65G, and so is not 2G aligned!

The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.

With this fix, running the above sequence yield to the following result:

   (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
   Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
   							size 0x80000000
   acpi PNP0C80:00: add_memory failed
   acpi PNP0C80:00: acpi_memory_enable_device() error
   acpi PNP0C80:00: Enumeration failure

Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Yang Shi f0849ac0b8 mm: thp: fix potential clearing to referenced flag in page_idle_clear_pte_refs_one()
For PTE-mapped THP, the compound THP has not been split to normal 4K
pages yet, the whole THP is considered referenced if any one of sub page
is referenced.

When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
to retrieve referenced bit.  But, the current code just returns the
result of the last PTE.  If the last PTE has not referenced, the
referenced flag will be cleared.

Just set referenced when ptep{pmdp}_clear_young_notify() returns true.

Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Gang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Pavel Tatashin c9e97a1997 mm: initialize pages on demand during boot
Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the
rest later on.

It is, however, problematic to know how many pages the kernel needs
during boot.  Different modules and kernel parameters may change the
requirement, so the boot cpu either initializes too many pages or runs
out of memory.

To fix that, initialize early pages on demand.  This ensures the kernel
does the minimum amount of work to initialize pages during boot and
leaves the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).

The on-demand code is permanently disabled using static branching once
deferred pages are initialized.  After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.

Sergey Senozhatsky noticed that while deferred pages currently make
sense only on NUMA machines (we start one thread per latency node),
CONFIG_NUMA is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT,
so that is also must be addressed in the patch.

[akpm@linux-foundation.org: fix typo in comment, make deferred_pages static]
[pasha.tatashin@oracle.com: fix min() type mismatch warning]
  Link: http://lkml.kernel.org/r/20180212164543.26592-1-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: use zone_to_nid() in deferred_grow_zone()]
  Link: http://lkml.kernel.org/r/20180214163343.21234-2-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: might_sleep warning]
  Link: http://lkml.kernel.org/r/20180306192022.28289-1-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: s/spin_lock/spin_lock_irq/ in page_alloc_init_late()]
[pasha.tatashin@oracle.com: v5]
  Link: http://lkml.kernel.org/r/20180309220807.24961-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: tweak comments]
[pasha.tatashin@oracle.com: v6]
  Link: http://lkml.kernel.org/r/20180313182355.17669-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180209192216.20509-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Pavel Tatashin 3a2d7fa8a3 mm: disable interrupts while initializing deferred pages
Vlastimil Babka reported about a window issue during which when deferred
pages are initialized, and the current version of on-demand
initialization is finished, allocations may fail.  While this is highly
unlikely scenario, since this kind of allocation request must be large,
and must come from interrupt handler, we still want to cover it.

We solve this by initializing deferred pages with interrupts disabled,
and holding node_size_lock spin lock while pages in the node are being
initialized.  The on-demand deferred page initialization that comes
later will use the same lock, and thus synchronize with
deferred_init_memmap().

It is unlikely for threads that initialize deferred pages to be
interrupted.  They run soon after smp_init(), but before modules are
initialized, and long before user space programs.  This is why there is
no adverse effect of having these threads running with interrupts
disabled.

[pasha.tatashin@oracle.com: v6]
  Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Randy Dunlap 8e7a0c9100 mm/swap_slots.c: use conditional compilation
For mm/swap_slots.c, use the traditional Linux method of conditional
compilation and linking instead of always compiling it by using #ifdef
CONFIG_SWAP and #endif for the entire source file (excluding header
files).

Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Anshuman Khandual 310253514b mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGE
alloc_contig_range() initiates compaction and eventual migration for the
purpose of either CMA or HugeTLB allocations.  At present, the reason
code remains the same MR_CMA for either of these cases.  Let's make it
MR_CONTIG_RANGE which will appropriately reflect the reason code in both
these cases.

Link: http://lkml.kernel.org/r/20180202091518.18798-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
David Woodhouse 57a7702b12 mm: always print RLIMIT_DATA warning
The documentation for ignore_rlimit_data says that it will print a
warning at first misuse.  Yet it doesn't seem to do that.

Fix the code to print the warning even when we allow the process to
continue.

Link: http://lkml.kernel.org/r/1517935505-9321-1-git-send-email-dwmw@amazon.co.uk
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Colin Ian King c01f0b54ef mm/ksm.c: make stable_node_dup() static
stable_node_dup() is local to the source and does not need to be in
global scope, so make it static.

Cleans up sparse warning:

  mm/ksm.c:1321:13: warning: symbol 'stable_node_dup' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180206221005.12642-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Shakeel Butt f9e13c0a5a slab, slub: skip unnecessary kasan_cache_shutdown()
The kasan quarantine is designed to delay freeing slab objects to catch
use-after-free.  The quarantine can be large (several percent of machine
memory size).  When kmem_caches are deleted related objects are flushed
from the quarantine but this requires scanning the entire quarantine
which can be very slow.  We have seen the kernel busily working on this
while holding slab_mutex and badly affecting cache_reaper, slabinfo
readers and memcg kmem cache creations.

It can easily reproduced by following script:

	yes . | head -1000000 | xargs stat > /dev/null
	for i in `seq 1 10`; do
		seq 500 | (cd /cg/memory && xargs mkdir)
		seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
			/cg/memory/{}/tasks && exec stat .' > /dev/null
		seq 500 | (cd /cg/memory && xargs rmdir)
	done

The busy stack:
    kasan_cache_shutdown
    shutdown_cache
    memcg_destroy_kmem_caches
    mem_cgroup_css_free
    css_free_rwork_fn
    process_one_work
    worker_thread
    kthread
    ret_from_fork

This patch is based on the observation that if the kmem_cache to be
destroyed is empty then there should not be any objects of this cache in
the quarantine.

Without the patch the script got stuck for couple of hours.  With the
patch the script completed within a second.

Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Mikulas Patocka 1ba586de22 mm/slab_common.c: remove test if cache name is accessible
Since commit db265eca77 ("mm/sl[aou]b: Move duping of slab name to
slab_common.c"), the kernel always duplicates the slab cache name when
creating a slab cache, so the test if the slab name is accessible is
useless.

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Shakeel Butt 613a5eb567 slab, slub: remove size disparity on debug kernel
I have noticed on debug kernel with SLAB, the size of some non-root
slabs were larger than their corresponding root slabs.

e.g. for radix_tree_node:
  $cat /proc/slabinfo | grep radix
  name     <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
  radix_tree_node 15052    15075      4096         1             1 ...

  $cat /cgroup/memory/temp/memory.kmem.slabinfo | grep radix
  name     <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
  radix_tree_node 1581      158       4120         1             2 ...

However for SLUB in debug kernel, the sizes were same.  On further
inspection it is found that SLUB always use kmem_cache.object_size to
measure the kmem_cache.size while SLAB use the given kmem_cache.size.
In the debug kernel the slab's size can be larger than its object_size.
Thus in the creation of non-root slab, the SLAB uses the root's size as
base to calculate the non-root slab's size and thus non-root slab's size
can be larger than the root slab's size.  For SLUB, the non-root slab's
size is measured based on the root's object_size and thus the size will
remain same for root and non-root slab.

This patch makes slab's object_size the default base to measure the
slab's size.

Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com
Fixes: 794b1248be ("memcg, slab: separate memcg vs root cache creation paths")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 302d55d51d slab: use 32-bit arithmetic in freelist_randomize()
SLAB doesn't support 4GB+ of objects per slab, therefore randomization
doesn't need size_t.

Link: http://lkml.kernel.org/r/20180305200730.15812-25-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 870b1fbb03 slub: make size_from_object() return unsigned int
Function returns size of the object without red zone which can't be
negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-24-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 19af27aff9 slub: make struct kmem_cache_order_objects::x unsigned int
struct kmem_cache_order_objects is for mixing order and number of
objects, and orders aren't big enough to warrant 64-bit width.

Propagate unsignedness down so that everything fits.

!!! Patch assumes that "PAGE_SIZE << order" doesn't overflow. !!!

Link: http://lkml.kernel.org/r/20180305200730.15812-23-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 284b50ddcf slub: make slab_index() return unsigned int
slab_index() returns index of an object within a slab which is at most
u15 (or u16?).

Iterators additionally guarantee that "p >= addr".

Link: http://lkml.kernel.org/r/20180305200730.15812-22-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 7bbdb81ee3 slab: make usercopy region 32-bit
If kmem case sizes are 32-bit, then usecopy region should be too.

Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan be4a7988b3 kasan: make kasan_cache_create() work with 32-bit slab cache sizes
If SLAB doesn't support 4GB+ kmem caches (it never did), KASAN should
not do it as well.

Link: http://lkml.kernel.org/r/20180305200730.15812-20-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 0293d1fdd6 slab: make kmem_cache_flags accept 32-bit object size
Now that all sizes are properly typed, propagate "unsigned int" down the
callgraph.

Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 44065b2e29 slub: make ->size unsigned int
Linux doesn't support negative length objects (including meta data).

Link: http://lkml.kernel.org/r/20180305200730.15812-18-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan 1b473f29d5 slub: make ->object_size unsigned int
Linux doesn't support negative length objects.

Link: http://lkml.kernel.org/r/20180305200730.15812-17-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Alexey Dobriyan e5d9998f3e slub: make ->cpu_partial unsigned int
/*
	 * cpu_partial determined the maximum number of objects
	 * kept in the per cpu partial lists of a processor.
	 */

Can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan 52ee6d74aa slub: make ->inuse unsigned int
->inuse is "the number of bytes in actual use by the object",
can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-14-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan 3a3791ec2e slub: make ->align unsigned int
Kmem cache alignment can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-13-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan d66e52d1e8 slub: make ->reserved unsigned int
->reserved is either 0 or sizeof(struct rcu_head), can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-12-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan eb7235eb84 slub: make ->remote_node_defrag_ratio unsigned int
->remote_node_defrag_ratio is in range 0..1000.

This also adds a check and modifies the behavior to return an error
code.  Before this patch invalid values were ignored.

Link: http://lkml.kernel.org/r/20180305200730.15812-9-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan ac914d08bb slab: make size_index_elem() unsigned int
size_index_elem() always works with small sizes (kmalloc caches are
32-bit) and returns small indexes.

Link: http://lkml.kernel.org/r/20180305200730.15812-8-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan d5f866550d slab: make size_index[] array u8
All those small numbers are reverse indexes into kmalloc caches array
and can't be negative.

On x86_64 "unsigned int = fls()" can drop CDQE instruction:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-2 (-2)
	Function                                     old     new   delta
	kmalloc_slab                                 101      99      -2

Link: http://lkml.kernel.org/r/20180305200730.15812-7-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan f4957d5bd0 slab: make kmem_cache_create() work with 32-bit sizes
struct kmem_cache::size and ::align were always 32-bit.

Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
kmem_cache_create(1UL<<32+1) created 1-byte cache as expected.

size_t doesn't work and never did.

Link: http://lkml.kernel.org/r/20180305200730.15812-6-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan 361d575e5c slab: make create_boot_cache() work with 32-bit sizes
struct kmem_cache::size has always been "int", all those
"size_t size" are fake.

Link: http://lkml.kernel.org/r/20180305200730.15812-5-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan 55de8b9c60 slab: make create_kmalloc_cache() work with 32-bit sizes
KMALLOC_MAX_CACHE_SIZE is 32-bit so is the largest kmalloc cache size.

Christoph said:
:
: Ok SLABs maximum allocation size is limited to 32M (see
: include/linux/slab.h:
:
: #define KMALLOC_SHIFT_HIGH      ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
:                                 (MAX_ORDER + PAGE_SHIFT - 1) : 25)
:
: And SLUB/SLOB pass all larger requests to the page allocator anyways.

Link: http://lkml.kernel.org/r/20180305200730.15812-4-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan 0be70327ec slab: make kmalloc_size() return "unsigned int"
kmalloc_size() derives size of kmalloc cache from internal index, which
can't be negative.

Propagate unsignedness a bit.

Link: http://lkml.kernel.org/r/20180305200730.15812-3-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan c86305743b slab: fixup calculate_alignment() argument type
Link: http://lkml.kernel.org/r/20180305200730.15812-1-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Chintan Pandya 86609d3319 mm/slub.c: use jitter-free reference while printing age
When SLUB_DEBUG catches some issues, it prints all the required debug
info.  However, in a few cases where allocation and free of the object
has happened in a very short time, 'age' might be misleading.  See the
example below:

  =============================================================================
  BUG kmalloc-256 (Tainted: G        W  O   ): Poison overwritten
  -----------------------------------------------------------------------------
  ...
  INFO: Allocated in binder_transaction+0x4b0/0x2448 age=731 cpu=3 pid=5314
  ...
  INFO: Freed in binder_free_transaction+0x2c/0x58 age=735 cpu=6 pid=2079
  ...
  Object fffffff14956a870: 6b 6b 6b 6b 6b 6b 6b 6b 67 6b 6b 6b 6b 6b 6b a5  kkkkkkkkgkkkk

In this case, object got freed later but 'age' shows otherwise.  This
could be because, while printing this info, we print allocation traces
first and free traces thereafter.  In between, if we get schedule out or
jiffies increment, (jiffies - t->when) could become meaningless.

Use the jitter free reference to calculate age.

New output will exactly be same.  'age' is still staying with single
jiffies ref in both prints.

Change-Id: I0846565807a4229748649bbecb1ffb743d71fcd8
Link: http://lkml.kernel.org/r/1520492010-19389-1-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Alexey Dobriyan 1c99ba2918 mm/slab_common.c: mark kmalloc machinery as __ro_after_init
kmalloc caches aren't relocated after being set up neither does
"size_index" array.

Link: http://lkml.kernel.org/r/20180226203519.GA6886@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00