When scanning for sources or targets, PageCompound is checked for huge
pages as they can be skipped quickly but it happens relatively late
after a lot of setup and checking. This patch short-cuts the check to
make it earlier. It might still change when the lock is acquired but
this has less overhead overall. The free scanner advances but the
migration scanner does not. Typically the free scanner encounters more
movable blocks that change state over the lifetime of the system and
also tends to scan more aggressively as it's actively filling its
portion of the physical address space with data. This could change in
the future but for the moment, this worked better in practice and
incurred fewer scan restarts.
The impact on latency and allocation success rates is marginal but the
free scan rates are reduced by 15% and system CPU usage is reduced by
3.3%. The 2-socket results are not materially different.
Link: http://lkml.kernel.org/r/20190118175136.31341-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Async migration aborts on spinlock contention but contention can be high
when there are multiple compaction attempts and kswapd is active. The
consequence is that the migration scanners move forward uselessly while
still contending on locks for longer while leaving suitable migration
sources behind.
This patch will acquire the lock but track when contention occurs. When
it does, the current pageblock will finish as compaction may succeed for
that block and then abort. This will have a variable impact on latency
as in some cases useless scanning is avoided (reduces latency) but a
lock will be contended (increase latency) or a single contended
pageblock is scanned that would otherwise have been skipped (increase
latency).
5.0.0-rc1 5.0.0-rc1
norescan-v3r16 finishcontend-v3r16
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 3002.07 ( 0.00%) 3153.17 ( -5.03%)
Amean fault-both-5 4684.47 ( 0.00%) 4280.52 ( 8.62%)
Amean fault-both-7 6815.54 ( 0.00%) 5811.50 * 14.73%*
Amean fault-both-12 10864.02 ( 0.00%) 9276.85 ( 14.61%)
Amean fault-both-18 12247.52 ( 0.00%) 11032.67 ( 9.92%)
Amean fault-both-24 15683.99 ( 0.00%) 14285.70 ( 8.92%)
Amean fault-both-30 18620.02 ( 0.00%) 16293.76 * 12.49%*
Amean fault-both-32 19250.28 ( 0.00%) 16721.02 * 13.14%*
5.0.0-rc1 5.0.0-rc1
norescan-v3r16 finishcontend-v3r16
Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Percentage huge-3 95.00 ( 0.00%) 96.82 ( 1.92%)
Percentage huge-5 94.22 ( 0.00%) 95.40 ( 1.26%)
Percentage huge-7 92.35 ( 0.00%) 95.92 ( 3.86%)
Percentage huge-12 91.90 ( 0.00%) 96.73 ( 5.25%)
Percentage huge-18 89.58 ( 0.00%) 96.77 ( 8.03%)
Percentage huge-24 90.03 ( 0.00%) 96.05 ( 6.69%)
Percentage huge-30 89.14 ( 0.00%) 96.81 ( 8.60%)
Percentage huge-32 90.58 ( 0.00%) 97.41 ( 7.54%)
There is a variable impact that is mostly good on latency while allocation
success rates are slightly higher. System CPU usage is reduced by about
10% but scan rate impact is mixed
Compaction migrate scanned 27997659.00 20148867
Compaction free scanned 120782791.00 118324914
Migration scan rates are reduced 28% which is expected as a pageblock is
used by the async scanner instead of skipped. The impact on the free
scanner is known to be variable. Overall the primary justification for
this patch is that completing scanning of a pageblock is very important
for later patches.
[yuehaibing@huawei.com: fix unused variable warning]
Link: http://lkml.kernel.org/r/20190118175136.31341-14-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pageblocks are marked for skip when no pages are isolated after a scan.
However, it's possible to hit corner cases where the migration scanner
gets stuck near the boundary between the source and target scanner. Due
to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
migrated can be reallocated before the pageblock is complete. The
pageblock is not necessarily skipped so it can be rescanned multiple
times. Similarly, a pageblock with some dirty/writeback pages may fail
to migrate and be rescanned until writeback completes which is wasteful.
This patch tracks if a pageblock is being rescanned. If so, then the
entire pageblock will be migrated as one operation. This narrows the
race window during which pages can be reallocated during migration.
Secondly, if there are pages that cannot be isolated then the pageblock
will still be fully scanned and marked for skipping. On the second
rescan, the pageblock skip is set and the migration scanner makes
progress.
5.0.0-rc1 5.0.0-rc1
findfree-v3r16 norescan-v3r16
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 3200.68 ( 0.00%) 3002.07 ( 6.21%)
Amean fault-both-5 4847.75 ( 0.00%) 4684.47 ( 3.37%)
Amean fault-both-7 6658.92 ( 0.00%) 6815.54 ( -2.35%)
Amean fault-both-12 11077.62 ( 0.00%) 10864.02 ( 1.93%)
Amean fault-both-18 12403.97 ( 0.00%) 12247.52 ( 1.26%)
Amean fault-both-24 15607.10 ( 0.00%) 15683.99 ( -0.49%)
Amean fault-both-30 18752.27 ( 0.00%) 18620.02 ( 0.71%)
Amean fault-both-32 21207.54 ( 0.00%) 19250.28 * 9.23%*
5.0.0-rc1 5.0.0-rc1
findfree-v3r16 norescan-v3r16
Percentage huge-3 96.86 ( 0.00%) 95.00 ( -1.91%)
Percentage huge-5 93.72 ( 0.00%) 94.22 ( 0.53%)
Percentage huge-7 94.31 ( 0.00%) 92.35 ( -2.08%)
Percentage huge-12 92.66 ( 0.00%) 91.90 ( -0.82%)
Percentage huge-18 91.51 ( 0.00%) 89.58 ( -2.11%)
Percentage huge-24 90.50 ( 0.00%) 90.03 ( -0.52%)
Percentage huge-30 91.57 ( 0.00%) 89.14 ( -2.65%)
Percentage huge-32 91.00 ( 0.00%) 90.58 ( -0.46%)
Negligible difference but this was likely a case when the specific
corner case was not hit. A previous run of the same patch based on an
earlier iteration of the series showed large differences where migration
rates could be halved when the corner case was hit.
The specific corner case where migration scan rates go through the roof
was due to a dirty/writeback pageblock located at the boundary of the
migration/free scanner did not happen in this case. When it does
happen, the scan rates multipled by massive margins.
Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to the migration scanner, this patch uses the free lists to
quickly locate a migration target. The search is different in that
lower orders will be searched for a suitable high PFN if necessary but
the search is still bound. This is justified on the grounds that the
free scanner typically scans linearly much more than the migration
scanner.
If a free page is found, it is isolated and compaction continues if
enough pages were isolated. For SYNC* scanning, the full pageblock is
scanned for any remaining free pages so that is can be marked for
skipping in the near future.
1-socket thpfioscale
5.0.0-rc1 5.0.0-rc1
isolmig-v3r15 findfree-v3r16
Amean fault-both-3 3024.41 ( 0.00%) 3200.68 ( -5.83%)
Amean fault-both-5 4749.30 ( 0.00%) 4847.75 ( -2.07%)
Amean fault-both-7 6454.95 ( 0.00%) 6658.92 ( -3.16%)
Amean fault-both-12 10324.83 ( 0.00%) 11077.62 ( -7.29%)
Amean fault-both-18 12896.82 ( 0.00%) 12403.97 ( 3.82%)
Amean fault-both-24 13470.60 ( 0.00%) 15607.10 * -15.86%*
Amean fault-both-30 17143.99 ( 0.00%) 18752.27 ( -9.38%)
Amean fault-both-32 17743.91 ( 0.00%) 21207.54 * -19.52%*
The impact on latency is variable but the search is optimistic and
sensitive to the exact system state. Success rates are similar but the
major impact is to the rate of scanning
5.0.0-rc1 5.0.0-rc1
isolmig-v3r15 findfree-v3r16
Compaction migrate scanned 25646769 29507205
Compaction free scanned 201558184 100359571
The free scan rates are reduced by 50%. The 2-socket reductions for the
free scanner are more dramatic which is a likely reflection that the
machine has more memory.
[dan.carpenter@oracle.com: fix static checker warning]
[vbabka@suse.cz: correct number of pages scanned for lower orders]
Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to either a fast search of the free list or a linear scan, it is
possible for multiple compaction instances to pick the same pageblock
for migration. This is lucky for one scanner and increased scanning for
all the others. It also allows a race between requests on which first
allocates the resulting free block.
This patch tests and updates the pageblock skip for the migration
scanner carefully. When isolating a block, it will check and skip if
the block is already in use. Once the zone lock is acquired, it will be
rechecked so that only one scanner can set the pageblock skip for
exclusive use. Any scanner contending will continue with a linear scan.
The skip bit is still set if no pages can be isolated in a range. While
this may result in redundant scanning, it avoids unnecessarily acquiring
the zone lock when there are no suitable migration sources.
1-socket thpscale
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 3390.40 ( 0.00%) 3024.41 ( 10.80%)
Amean fault-both-5 5082.28 ( 0.00%) 4749.30 ( 6.55%)
Amean fault-both-7 7012.51 ( 0.00%) 6454.95 ( 7.95%)
Amean fault-both-12 11346.63 ( 0.00%) 10324.83 ( 9.01%)
Amean fault-both-18 15324.19 ( 0.00%) 12896.82 * 15.84%*
Amean fault-both-24 16088.50 ( 0.00%) 13470.60 * 16.27%*
Amean fault-both-30 18723.42 ( 0.00%) 17143.99 ( 8.44%)
Amean fault-both-32 18612.01 ( 0.00%) 17743.91 ( 4.66%)
5.0.0-rc1 5.0.0-rc1
findmig-v3r15 isolmig-v3r15
Percentage huge-3 89.83 ( 0.00%) 92.96 ( 3.48%)
Percentage huge-5 91.96 ( 0.00%) 93.26 ( 1.41%)
Percentage huge-7 92.85 ( 0.00%) 93.63 ( 0.84%)
Percentage huge-12 92.74 ( 0.00%) 92.80 ( 0.07%)
Percentage huge-18 91.71 ( 0.00%) 91.62 ( -0.10%)
Percentage huge-24 92.13 ( 0.00%) 91.50 ( -0.69%)
Percentage huge-30 93.79 ( 0.00%) 92.73 ( -1.13%)
Percentage huge-32 91.27 ( 0.00%) 91.94 ( 0.74%)
This shows a reasonable reduction in latency as multiple compaction
scanners do not operate on the same blocks with a similar allocation
success rate.
Compaction migrate scanned 41093126 25646769
Migration scan rates are reduced by 38%.
Link: http://lkml.kernel.org/r/20190118175136.31341-11-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The migration scanner is a linear scan of a zone with a potentiall large
search space. Furthermore, many pageblocks are unusable such as those
filled with reserved pages or partially filled with pages that cannot
migrate. These still get scanned in the common case of allocating a THP
and the cost accumulates.
The patch uses a partial search of the free lists to locate a migration
source candidate that is marked as MOVABLE when allocating a THP. It
prefers picking a block with a larger number of free pages already on
the basis that there are fewer pages to migrate to free the entire
block. The lowest PFN found during searches is tracked as the basis of
the start for the linear search after the first search of the free list
fails. After the search, the free list is shuffled so that the next
search will not encounter the same page. If the search fails then the
subsequent searches will be shorter and the linear scanner is used.
If this search fails, or if the request is for a small or
unmovable/reclaimable allocation then the linear scanner is still used.
It is somewhat pointless to use the list search in those cases. Small
free pages must be used for the search and there is no guarantee that
movable pages are located within that block that are contiguous.
5.0.0-rc1 5.0.0-rc1
noboost-v3r10 findmig-v3r15
Amean fault-both-3 3771.41 ( 0.00%) 3390.40 ( 10.10%)
Amean fault-both-5 5409.05 ( 0.00%) 5082.28 ( 6.04%)
Amean fault-both-7 7040.74 ( 0.00%) 7012.51 ( 0.40%)
Amean fault-both-12 11887.35 ( 0.00%) 11346.63 ( 4.55%)
Amean fault-both-18 16718.19 ( 0.00%) 15324.19 ( 8.34%)
Amean fault-both-24 21157.19 ( 0.00%) 16088.50 * 23.96%*
Amean fault-both-30 21175.92 ( 0.00%) 18723.42 * 11.58%*
Amean fault-both-32 21339.03 ( 0.00%) 18612.01 * 12.78%*
5.0.0-rc1 5.0.0-rc1
noboost-v3r10 findmig-v3r15
Percentage huge-3 86.50 ( 0.00%) 89.83 ( 3.85%)
Percentage huge-5 92.52 ( 0.00%) 91.96 ( -0.61%)
Percentage huge-7 92.44 ( 0.00%) 92.85 ( 0.44%)
Percentage huge-12 92.98 ( 0.00%) 92.74 ( -0.25%)
Percentage huge-18 91.70 ( 0.00%) 91.71 ( 0.02%)
Percentage huge-24 91.59 ( 0.00%) 92.13 ( 0.60%)
Percentage huge-30 90.14 ( 0.00%) 93.79 ( 4.04%)
Percentage huge-32 90.03 ( 0.00%) 91.27 ( 1.37%)
This shows an improvement in allocation latencies with similar
allocation success rates. While not presented, there was a 31%
reduction in migration scanning and a 8% reduction on system CPU usage.
A 2-socket machine showed similar benefits.
[mgorman@techsingularity.net: several fixes]
Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
[vbabka@suse.cz: migrate block that was found-fast, some optimisations]
Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <Vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When pageblocks get fragmented, watermarks are artifically boosted to
reclaim pages to avoid further fragmentation events. However,
compaction is often either fragmentation-neutral or moving movable pages
away from unmovable/reclaimable pages. As the true watermarks are
preserved, allow compaction to ignore the boost factor.
The expected impact is very slight as the main benefit is that
compaction is slightly more likely to succeed when the system has been
fragmented very recently. On both 1-socket and 2-socket machines for
THP-intensive allocation during fragmentation the success rate was
increased by less than 1% which is marginal. However, detailed tracing
indicated that failure of migration due to a premature ENOMEM triggered
by watermark checks were eliminated.
Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When compaction is finishing, it uses a flag to ensure the pageblock is
complete but it makes sense to always complete migration of a pageblock.
Minimally, skip information is based on a pageblock and partially
scanned pageblocks may incur more scanning in the future. The pageblock
skip handling also becomes more strict later in the series and the hint
is more useful if a complete pageblock was always scanned.
The potentially impacts latency as more scanning is done but it's not a
consistent win or loss as the scanning is not always a high percentage
of the pageblock and sometimes it is offset by future reductions in
scanning. Hence, the results are not presented this time due to a
misleading mix of gains/losses without any clear pattern. However, full
scanning of the pageblock is important for later patches.
Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pages with no migration handler use a fallback handler which sometimes
works and sometimes persistently retries. A historical example was
blockdev pages but there are others such as odd refcounting when
page->private is used. These are retried multiple times which is
wasteful during compaction so this patch will fail migration faster
unless the caller specifies MIGRATE_SYNC.
This is not expected to help THP allocation success rates but it did
reduce latencies very slightly in some cases.
1-socket thpfioscale
4.20.0 4.20.0
noreserved-v2r15 failfast-v2r15
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 3839.67 ( 0.00%) 3833.72 ( 0.15%)
Amean fault-both-5 5177.47 ( 0.00%) 4967.15 ( 4.06%)
Amean fault-both-7 7245.03 ( 0.00%) 7139.19 ( 1.46%)
Amean fault-both-12 11534.89 ( 0.00%) 11326.30 ( 1.81%)
Amean fault-both-18 16241.10 ( 0.00%) 16270.70 ( -0.18%)
Amean fault-both-24 19075.91 ( 0.00%) 19839.65 ( -4.00%)
Amean fault-both-30 22712.11 ( 0.00%) 21707.05 ( 4.43%)
Amean fault-both-32 21692.92 ( 0.00%) 21968.16 ( -1.27%)
The 2-socket results are not materially different. Scan rates are
similar as expected.
Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's non-obvious that high-order free pages are split into order-0 pages
from the function name. Fix it.
Link: http://lkml.kernel.org/r/20190118175136.31341-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A zone parameter is passed into a number of top-level compaction
functions despite the fact that it's already in compact_control. This
is harmless but it did need an audit to check if zone actually ever
changes meaningfully. This patches removes the parameter in a number of
top-level functions. The change could be much deeper but this was
enough to briefly clarify the flow.
No functional change.
Link: http://lkml.kernel.org/r/20190118175136.31341-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last_migrated_pfn field is a bit dubious as to whether it really
helps but either way, the information from it can be inferred without
increasing the size of compact_control so remove the field.
Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
compact_control spans two cache lines with write-intensive lines on
both. Rearrange so the most write-intensive fields are in the same
cache line. This has a negligible impact on the overall performance of
compaction and is more a tidying exercise than anything.
Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Increase success rates and reduce latency of compaction", v3.
This series reduces scan rates and success rates of compaction,
primarily by using the free lists to shorten scans, better controlling
of skip information and whether multiple scanners can target the same
block and capturing pageblocks before being stolen by parallel requests.
The series is based on mmotm from January 9th, 2019 with the previous
compaction series reverted.
I'm mostly using thpscale to measure the impact of the series. The
benchmark creates a large file, maps it, faults it, punches holes in the
mapping so that the virtual address space is fragmented and then tries
to allocate THP. It re-executes for different numbers of threads. From
a fragmentation perspective, the workload is relatively benign but it
does stress compaction.
The overall impact on latencies for a 1-socket machine is
baseline patches
Amean fault-both-3 3832.09 ( 0.00%) 2748.56 * 28.28%*
Amean fault-both-5 4933.06 ( 0.00%) 4255.52 ( 13.73%)
Amean fault-both-7 7017.75 ( 0.00%) 6586.93 ( 6.14%)
Amean fault-both-12 11610.51 ( 0.00%) 9162.34 * 21.09%*
Amean fault-both-18 17055.85 ( 0.00%) 11530.06 * 32.40%*
Amean fault-both-24 19306.27 ( 0.00%) 17956.13 ( 6.99%)
Amean fault-both-30 22516.49 ( 0.00%) 15686.47 * 30.33%*
Amean fault-both-32 23442.93 ( 0.00%) 16564.83 * 29.34%*
The allocation success rates are much improved
baseline patches
Percentage huge-3 85.99 ( 0.00%) 97.96 ( 13.92%)
Percentage huge-5 88.27 ( 0.00%) 96.87 ( 9.74%)
Percentage huge-7 85.87 ( 0.00%) 94.53 ( 10.09%)
Percentage huge-12 82.38 ( 0.00%) 98.44 ( 19.49%)
Percentage huge-18 83.29 ( 0.00%) 99.14 ( 19.04%)
Percentage huge-24 81.41 ( 0.00%) 97.35 ( 19.57%)
Percentage huge-30 80.98 ( 0.00%) 98.05 ( 21.08%)
Percentage huge-32 80.53 ( 0.00%) 97.06 ( 20.53%)
That's a nearly perfect allocation success rate.
The biggest impact is on the scan rates
Compaction migrate scanned 55893379 19341254
Compaction free scanned 474739990 11903963
The number of pages scanned for migration was reduced by 65% and the
free scanner was reduced by 97.5%. So much less work in exchange for
lower latency and better success rates.
The series was also evaluated using a workload that heavily fragments
memory but the benefits there are also significant, albeit not
presented.
It was commented that we should be rethinking scanning entirely and to a
large extent I agree. However, to achieve that you need a lot of this
series in place first so it's best to make the linear scanners as best
as possible before ripping them out.
This patch (of 22):
The isolate and migrate scanners should never isolate more than a
pageblock of pages so unsigned int is sufficient saving 8 bytes on a
64-bit build.
Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'end_byte' parameter of filemap_range_has_page is required to be
inclusive, so follow the rule.
Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
Fixes: 6be96d3ad3 ("fs: return if direct I/O will trigger writeback")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: zhangyi (F) <yi.zhang@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap_vma_readahead()'s comment is missing, just add it.
Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Hugh Dickins <hughd@google.com
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap readahead would read in a few pages regardless if the underlying
device is busy or not. It may incur long waiting time if the device is
congested, and it may also exacerbate the congestion.
Use inode_read_congested() to check if the underlying device is busy or
not like what file page readahead does. Get inode from
swap_info_struct.
Although we can add inode information in swap_address_space
(address_space->host), it may lead some unexpected side effect, i.e. it
may break mapping_cap_account_dirty(). Using inode from
swap_info_struct seems simple and good enough.
Just does the check in vma_cluster_readahead() since
swap_vma_readahead() is just used for non-rotational device which much
less likely has congestion than traditional HDD.
Although swap slots may be consecutive on swap partition, it still may
be fragmented on swap file. This check would help to reduce excessive
stall for such case.
The test with page_fault1 of will-it-scale (sometimes tracing may just
show runtest.py that is the wrapper script of page_fault1), which
basically launches NR_CPU threads to generate 128MB anonymous pages for
each thread, on my virtual machine with congested HDD shows long tail
latency is reduced significantly.
Without the patch
page_fault1_thr-1490 [023] 129.311706: funcgraph_entry: #57377.796 us | do_swap_page();
page_fault1_thr-1490 [023] 129.369103: funcgraph_entry: 5.642us | do_swap_page();
page_fault1_thr-1490 [023] 129.369119: funcgraph_entry: #1289.592 us | do_swap_page();
page_fault1_thr-1490 [023] 129.370411: funcgraph_entry: 4.957us | do_swap_page();
page_fault1_thr-1490 [023] 129.370419: funcgraph_entry: 1.940us | do_swap_page();
page_fault1_thr-1490 [023] 129.378847: funcgraph_entry: #1411.385 us | do_swap_page();
page_fault1_thr-1490 [023] 129.380262: funcgraph_entry: 3.916us | do_swap_page();
page_fault1_thr-1490 [023] 129.380275: funcgraph_entry: #4287.751 us | do_swap_page();
With the patch
runtest.py-1417 [020] 301.925911: funcgraph_entry: #9870.146 us | do_swap_page();
runtest.py-1417 [020] 301.935785: funcgraph_entry: 9.802us | do_swap_page();
runtest.py-1417 [020] 301.935799: funcgraph_entry: 3.551us | do_swap_page();
runtest.py-1417 [020] 301.935806: funcgraph_entry: 2.142us | do_swap_page();
runtest.py-1417 [020] 301.935853: funcgraph_entry: 6.938us | do_swap_page();
runtest.py-1417 [020] 301.935864: funcgraph_entry: 3.765us | do_swap_page();
runtest.py-1417 [020] 301.935871: funcgraph_entry: 3.600us | do_swap_page();
runtest.py-1417 [020] 301.935878: funcgraph_entry: 7.202us | do_swap_page();
[akpm@linux-foundation.org: code cleanup]
[yang.shi@linux.alibaba.com: add comment]
Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Hugh Dickins <hughd@google.com
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After we establish a reference on the page, we check the pointer
continues to be in the correct position in i_pages. Checking
page->index afterwards is unnecessary; if it were to change, then the
pointer to it from the page cache would also move. The check used to be
done before grabbing a reference on the page which was racy (see commit
9cbb4cb21b ("mm: find_get_pages_contig fixlet")), but nobody noticed
that moving the check after grabbing the reference was redundant.
Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the current implementation, there are two places to isolate a range
of page: __offline_pages() and alloc_contig_range(). During this
procedure, it will drain pages on pcp list.
Below is a brief call flow:
__offline_pages()/alloc_contig_range()
start_isolate_page_range()
set_migratetype_isolate()
drain_all_pages()
drain_all_pages() <--- A
This snippet shows the current logic is isolate and drain pcp list for
each pageblock and drain pcp list again for the whole range.
start_isolate_page_range is responsible for isolating the given pfn
range. One part of that job is to make sure that also pages that are on
the allocator pcp lists are properly isolated. Otherwise they could be
reused and the range wouldn't be completely isolated until the memory is
freed back. While there is no strict guarantee here because pages might
get allocated at any time before drain_all_pages is called there doesn't
seem to be any strong demand for such a guarantee.
In any case, draining is already done at the isolation level and there
is no need to do it again later by start_isolate_page_range callers
(memory hotplug and CMA allocator currently). Therefore remove
pointless draining in existing callers to make the code more clear and
functionally correct.
[mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "arm64/mm: Enable HugeTLB migration", v4.
This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series. All permutations have been tested except for the 16GB.
CONT PTE PMD CONT PMD PUD
-------- --- -------- ---
4K: 64K 2M 32M 1G
16K: 2M 32M 1G
64K: 2M 512M 16G
First the series adds migration support for PUD based huge pages. It
then adds a platform specific hook to query an architecture if a given
huge page size is supported for migration while also providing a default
fallback option preserving the existing semantics which just checks for
(PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB
migration on arm64 and subscribe to this new platform specific hook by
defining an override.
The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.
This patch (of 5):
During huge page allocation it's migratability is checked to determine
if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
But the movability aspect of the huge page could depend on other factors
than just migratability. Movability in itself is a distinct property
which should not be tied with migratability alone.
This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone. At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.
Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysctl_extfrag_handler() neglects to propagate the return value from
proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need
to exist, so just use proc_dointvec_minmax() directly.
Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Aditya Pakki <pakki001@umn.edu>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
enabled. Some test cases in vmalloc test suite module require and make
use of that function. Please note, that it is not supposed to be used
for other purposes.
We need it only for performance analysis, stressing and stability check
of vmalloc allocator.
Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmalloc_user*() calls differ from normal vmalloc() only in that they set
VM_USERMAP flags for the area. During the whole history of vmalloc.c
changes now it is possible simply to pass VM_USERMAP flags directly to
__vmalloc_node_range() call instead of finding the area (which obviously
takes time) after the allocation.
Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__vmalloc_area_node() calls vfree() on error path, which in turn calls
kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().
Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When VM_NO_GUARD is not set area->size includes adjacent guard page,
thus for correct size checking get_vm_area_size() should be used, but
not area->size.
This fixes possible kernel oops when userspace tries to mmap an area on
1 page bigger than was allocated by vmalloc_user() call: the size check
inside remap_vmalloc_range_partial() accounts non-existing guard page
also, so check successfully passes but vmalloc_to_page() returns NULL
(guard page does not physically exist).
The following code pattern example should trigger an oops:
static int oops_mmap(struct file *file, struct vm_area_struct *vma)
{
void *mem;
mem = vmalloc_user(4096);
BUG_ON(!mem);
/* Do not care about mem leak */
return remap_vmalloc_range(vma, mem, 0);
}
And userspace simply mmaps size + PAGE_SIZE:
mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);
Possible candidates for oops which do not have any explicit size
checks:
*** drivers/media/usb/stkwebcam/stk-webcam.c:
v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
Or the following one:
*** drivers/video/fbdev/core/fbmem.c
static int
fb_mmap(struct file *file, struct vm_area_struct * vma)
...
res = fb->fb_mmap(info, vma);
Where fb_mmap callback calls remap_vmalloc_range() directly without any
explicit checks:
*** drivers/video/fbdev/vfb.c
static int vfb_mmap(struct fb_info *info,
struct vm_area_struct *vma)
{
return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
}
Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch repeats the original one from David S Miller:
2dca6999ee ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")
but for missed vmalloc_32_user() case, which also requires correct
alignment of virtual address on kernel side to avoid D-caches aliases.
A bit of copy-paste from original patch to recover in memory of what is
all about:
When a vmalloc'd area is mmap'd into userspace, some kind of
co-ordination is necessary for this to work on platforms with cpu
D-caches which can have aliases.
Otherwise kernel side writes won't be seen properly in userspace and
vice versa.
If the kernel side mapping and the user side one have the same
alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
of VMA and for all current users this is true. VM_SHARED will force
SHMLBA alignment of the user side mmap on platforms with D-cache
aliasing matters.
David S. Miller
> What are the user-visible runtime effects of this change?
In simple words: proper alignment avoids possible difference in data,
seen by different virtual mapings: userspace and kernel in our case.
I.e. userspace reads cache line A, kernel writes to cache line B. Both
cache lines correspond to the same physical memory (thus aliases).
So this should fix data corruption for archs with vivt and vipt caches,
e.g. armv6. Personally I've never worked with this archs, I just
spotted the strange difference in code: for one case we do alignment,
for another - not. I have a strong feeling that David simply missed
vmalloc_32_user() case.
>
> Is a -stable backport needed?
No, I do not think so. The only one user of vmalloc_32_user() is
virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
description "The main use of this frame buffer device is testing and
debugging the frame buffer subsystem. Do NOT enable it for normal
systems!".
And it seems to me that this vfb.c does not need 32bit addressable pages
(vmalloc_32_user() case), because it is virtual device and should not
care about things like dma32 zones, etc. Probably is better to clean
the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
and wipe out vmalloc_32_user() from vmalloc.c completely. But I'm not
very much sure that this is worth to do, that's so minor, so we can
leave it as is.
Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
functions, so, the users don't have to explicitly check that condition.
This is purely code cleanup patch without any functional change. Only
the order of checks in memcg_charge_slab() can potentially be changed
but the functionally it will be same. This should not matter as
memcg_charge_slab() is not in the hot path.
Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two cases when put_cpu_partial() is invoked.
* __slab_free
* get_partial_node
This patch just makes it cover these two cases.
Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages. If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.
[ Note that we do not consider PageSwapCache() pages at least for now,
since we don't want to complicate __get_ksm_page(), which has nice
optimization based on this (for the migration case). Currenly it is
spinning on PageSwapCache() pages, waiting for when they have
unfreezed counters (i.e., for the migration finish). But we don't want
to make it also spinning on swap cache pages, which we try to reuse,
since there is not a very high probability to reuse them. So, for now
we do not consider PageSwapCache() pages at all. ]
So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree. Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing. After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.
[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.
1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"
This patch (of 2):
At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.
Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_vmap_area() can return a NULL pointer and we're going to
dereference it without checking it first. Use the existing
find_vm_area() function which does exactly what we want and checks for
the NULL pointer.
Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
Fixes: f3c01d2f3a ("mm: vmalloc: avoid racy handling of debugobjects in vunmap")
Signed-off-by: Liviu Dudau <liviu@dudau.co.uk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced. With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times. Modify
external providers of online callback to align with the change.
[arunks@codeaurora.org: v11]
Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"addr" function argument is not used in alloc_consistency_checks() at
all, so remove it.
Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
Fixes: becfda68ab ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),
alc = kmalloc_node(memsize, gfp, node);
init_arraycache(&alc->ac, entries, batch);
kmemleak_no_scan(ac);
Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the latter by lifting kmemleak_no_scan()
out of init_arraycache().
There is another place that calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.
kmemleak: Found object by alias at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
lookup_object+0x84/0xac
find_and_get_object+0x84/0xe4
kmemleak_no_scan+0x74/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
kmemleak: Object 0xffff8007b9aa7e00 (size 256):
kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
kmemleak_alloc+0x84/0xb8
kmem_cache_alloc_node_trace+0x31c/0x3a0
__kmalloc_node+0x58/0x78
setup_kmem_cache_node+0x26c/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
kmemleak_no_scan+0x90/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
Fixes: 1fe00d50a9 ("slab: factor out initialization of array cache")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
new_slab_objects() will return immediately if freelist is not NULL.
if (freelist)
return freelist;
One more assignment operation could be avoided.
Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn
Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kasan_p4d_table(), kasan_pmd_table() and kasan_pud_table() are declared
as returning bool, but return 0 instead of false, which produces a
coccinelle warning. Fix it.
Link: http://lkml.kernel.org/r/1fa6fadf644859e8a6a8ecce258444b49be8c7ee.1551716733.git.andreyknvl@google.com
Fixes: 0207df4fa1 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Building little-endian allmodconfig kernels on arm64 started failing
with the generated atomic.h implementation, since we now try to call
kasan helpers from the EFI stub:
aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set':
include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write'
I suspect that we get similar problems in other files that explicitly
disable KASAN for some reason but call atomic_t based helper functions.
We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro
that the compiler sets instead of checking CONFIG_KASAN, but this in
turn requires a small hack in mm/kasan/common.c so we do see the extern
declaration there instead of the inline function.
Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de
Fixes: b1864b828644 ("locking/atomics: build atomic headers as required")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
It triggers false positives in the allocation path:
BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
Read of size 8 at addr ffff88881f800000 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
__asan_report_load8_noabort+0x19/0x20
memchr_inv+0x2ea/0x330
kernel_poison_pages+0x103/0x3d5
get_page_from_freelist+0x15e7/0x4d90
because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.
Also, false positives in free path,
BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
check_memory_region+0x22d/0x250
memset+0x28/0x40
kernel_poison_pages+0x29e/0x3d5
__free_pages_ok+0x75f/0x13e0
due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.
Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use after scope bugs detector seems to be almost entirely useless for
the linux kernel. It exists over two years, but I've seen only one
valid bug so far [1]. And the bug was fixed before it has been
reported. There were some other use-after-scope reports, but they were
false-positives due to different reasons like incompatibility with
structleak plugin.
This feature significantly increases stack usage, especially with GCC <
9 version, and causes a 32K stack overflow. It probably adds
performance penalty too.
Given all that, let's remove use-after-scope detector entirely.
While preparing this patch I've noticed that we mistakenly enable
use-after-scope detection for clang compiler regardless of
CONFIG_KASAN_EXTRA setting. This is also fixed now.
[1] http://lkml.kernel.org/r/<20171129052106.rhgbjhhis53hkgfn@wfg-t540p.sh.intel.com>
Link: http://lkml.kernel.org/r/20190111185842.13978-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Will Deacon <will.deacon@arm.com> [arm64]
Cc: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When soft_offline_in_use_page() runs on a thp tail page after pmd is
split, we trigger the following VM_BUG_ON_PAGE():
Memory failure: 0x3755ff: non anonymous thp
__get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
flags: 0x2fffff80000000()
raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:519!
soft_offline_in_use_page() passed refcount and page lock from tail page
to head page, which is not needed because we can pass any subpage to
split_huge_page().
Naoya had fixed a similar issue in c3901e722b ("mm: hwpoison: fix thp
split handling in memory_failure()"). But he missed fixing soft
offline.
Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
Fixes: 61f5d698cc ("mm: re-enable THP")
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"2 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hugetlbfs: fix races and page leaks during migration
kasan: turn off asan-stack for clang-8 and earlier
hugetlb pages should only be migrated if they are 'active'. The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It
then migrates the pages associated with the file from one node to
another. When the program exits, huge page counts are as follows:
node0
1024 free_hugepages
1024 nr_hugepages
node1
0 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:
node0
1024 free_hugepages
1024 nr_hugepages
node1
1024 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.
To fix that, check for this condition before migrating a huge page. If
the condition is detected, return EBUSY for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
security_mmap_addr() does a capability check with current_cred(), but
we can reach this code from contexts like a VFS write handler where
current_cred() must not be used.
This can be abused on systems without SMAP to make NULL pointer
dereferences exploitable again.
Fixes: 8869477a49 ("security: protect from stack expansion into low vm addresses")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we made the shmem_reserve_inode call in shmem_link conditional, we
forgot to update the declaration for ret so that it always has a known
value. Dan Carpenter pointed out this deficiency in the original patch.
Fixes: 1062af920c ("tmpfs: fix link accounting when a tmpfile is linked in")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Matej Kupljen <matej.kupljen@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 9da3f2b740.
It was well-intentioned, but wrong. Overriding the exception tables for
instructions for random reasons is just wrong, and that is what the new
code did.
It caused problems for tracing, and it caused problems for strncpy_from_user(),
because the new checks made perfectly valid use cases break, rather than
catch things that did bad things.
Unchecked user space accesses are a problem, but that's not a reason to
add invalid checks that then people have to work around with silly flags
(in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
odd way to say "this commit was wrong" and was sprinked into random
places to hide the wrongness).
The real fix to unchecked user space accesses is to get rid of the
special "let's not check __get_user() and __put_user() at all" logic.
Make __{get|put}_user() be just aliases to the regular {get|put}_user()
functions, and make it impossible to access user space without having
the proper checks in places.
The raison d'être of the special double-underscore versions used to be
that the range check was expensive, and if you did multiple user
accesses, you'd do the range check up front (like the signal frame
handling code, for example). But SMAP (on x86) and PAN (on ARM) have
made that optimization pointless, because the _real_ expense is the "set
CPU flag to allow user space access".
Do let's not break the valid cases to catch invalid cases that shouldn't
even exist.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rong Chen has reported the following boot crash:
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:page_mapping+0x12/0x80
Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
Call Trace:
__dump_page+0x14/0x2c0
is_mem_section_removable+0x24c/0x2c0
removable_show+0x87/0xa0
dev_attr_show+0x25/0x60
sysfs_kf_seq_show+0xba/0x110
seq_read+0x196/0x3f0
__vfs_read+0x34/0x180
vfs_read+0xa0/0x150
ksys_read+0x44/0xb0
do_syscall_64+0x5e/0x4a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
and bisected it down to commit efad4e475c ("mm, memory_hotplug:
is_mem_section_removable do not pass the end of a zone").
The reason for the crash is that the mapping is garbage for poisoned
(uninitialized) page. This shouldn't happen as all pages in the zone's
boundary should be initialized.
Later debugging revealed that the actual problem is an off-by-one when
evaluating the end_page. 'start_pfn + nr_pages' resp 'zone_end_pfn'
refers to a pfn after the range and as such it might belong to a
differen memory section.
This along with CONFIG_SPARSEMEM then makes the loop condition
completely bogus because a pointer arithmetic doesn't work for pages
from two different sections in that memory model.
Fix the issue by reworking is_pageblock_removable to be pfn based and
only use struct page where necessary. This makes the code slightly
easier to follow and we will remove the problematic pointer arithmetic
completely.
Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
Fixes: efad4e475c ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: <rong.a.chen@intel.com>
Tested-by: <rong.a.chen@intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memdump_user usually gets fed unchecked userspace input. Blasting a
full backtrace into dmesg every time is a bit excessive - I'm not sure
on the kernel rule in general, but at least in drm we're trying not to
let unpriviledge userspace spam the logs freely. Definitely not entire
warning backtraces.
It also means more filtering for our CI, because our testsuite exercises
these corner cases and so hits these a lot.
Link: http://lkml.kernel.org/r/20190220204058.11676-1-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>