linux-sg2042

History

Michal Hocko b6459cc154 vmscan: consider classzone_idx in compaction_ready Motivation: As pointed out by Linus [2][3] relying on zone_reclaimable as a way to communicate the reclaim progress is rater dubious. I tend to agree, not only it is really obscure, it is not hard to imagine cases where a single page freed in the loop keeps all the reclaimers looping without getting any progress because their gfp_mask wouldn't allow to get that page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather rare so it doesn't happen in the practice but the current logic which we have is rather obscure and hard to follow a also non-deterministic. This is an attempt to make the OOM detection more deterministic and easier to follow because each reclaimer basically tracks its own progress which is implemented at the page allocator layer rather spread out between the allocator and the reclaim. The more on the implementation is described in the first patch. I have tested several different scenarios but it should be clear that testing OOM killer is quite hard to be representative. There is usually a tiny gap between almost OOM and full blown OOM which is often time sensitive. Anyway, I have tested the following 2 scenarios and I would appreciate if there are more to test. Testing environment: a virtual machine with 2G of RAM and 2CPUs without any swap to make the OOM more deterministic. 1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G file size, removes the files and starts over again) running in parallel for 10s to build up a lot of dirty pages when 100 parallel mem_eaters (anon private populated mmap which waits until it gets signal) with 80M each. This causes an OOM flood of course and I have compared both patched and unpatched kernels. The test is considered finished after there are no OOM conditions detected. This should tell us whether there are any excessive kills or some of them premature (e.g. due to dirty pages): I have performed two runs this time each after a fresh boot. * base kernel $ grep "Out of memory:" base-oom-run1.log \| wc -l 78 $ grep "Out of memory:" base-oom-run2.log \| wc -l 78 $ grep "Kill process" base-oom-run1.log \| tail -n1 [ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child $ grep "Kill process" base-oom-run2.log \| tail -n1 [ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" base-oom-run1.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61 $ grep "DMA32 free:" base-oom-run2.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52 $ grep "DMA32.all_unreclaimable? no" base-oom-run1.log \| wc -l 1 $ grep "DMA32.all_unreclaimable? no" base-oom-run2.log \| wc -l 3 * patched kernel $ grep "Out of memory:" patched-oom-run1.log \| wc -l 78 miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log \| wc -l 77 e grep "Kill process" patched-oom-run1.log \| tail -n1 [ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child $ grep "Kill process" patched-oom-run2.log \| tail -n1 [ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" patched-oom-run1.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78 $ grep "DMA32 free:" patched-oom-run2.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77 e grep "DMA32.all_unreclaimable? no" patched-oom-run1.log \| wc -l 2 $ grep "DMA32.all_unreclaimable? no" patched-oom-run2.log \| wc -l 3 The patched kernel run noticeably longer while invoking OOM killer same number of times. This means that the original implementation is much more aggressive and triggers the OOM killer sooner. free pages stats show that neither kernels went OOM too early most of the time, though. I guess the difference is in the backoff when retries without any progress do sleep for a while if there is memory under writeback or dirty which is highly likely considering the parallel IO. Both kernels have seen races where zone wasn't marked unreclaimable and we still hit the OOM killer. This is most likely a race where a task managed to exit between the last allocation attempt and the oom killer invocation. 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much memory as possible without triggering the OOM killer. This required a lot of tuning but I've considered 3 consecutive runs in three different boots without OOM as a success. * base kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(161024)}' /proc/meminfo) patched kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo) That means 40M more memory was usable without triggering OOM killer. The base kernel sometimes managed to handle the same as patched but it wasn't consistent and failed in at least on of the 3 runs. This seems like a minor improvement. I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented memory and under memory pressure. The results are in patch 11 where the logic is implemented. In short I can see huge improvement there. I am certainly interested in other usecases as well as well as any feedback. Especially those which require higher order requests. This patch (of 14): While playing with the oom detection rework [1] I have noticed that my heavy order-9 (hugetlb) load close to OOM ended up in an endless loop where the reclaim hasn't made any progress but did_some_progress didn't reflect that and compaction_suitable was backing off because no zone is above low wmark + 1 << order. It turned out that this is in fact an old standing bug in compaction_ready which ignores the requested_highidx and did the watermark check for 0 classzone_idx. This succeeds for zone DMA most of the time as the zone is mostly unused because of lowmem protection. As a result costly high order allocatios always report a successfull progress even when there was none. This wasn't a problem so far because these allocations usually fail quite early or retry only few times with __GFP_REPEAT but this will change after later patch in this series so make sure to not lie about the progress and propagate requested_highidx down to compaction_ready and use it for both the watermak check and compaction_suitable to fix this issue. [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org [2] https://lkml.org/lkml/2015/10/12/808 [3] https://lkml.org/lkml/2015/10/13/597 Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2016-05-20 17:58:30 -07:00
..
kasan	mm, kasan: fix compilation for CONFIG_SLAB	2016-04-01 17:03:37 -05:00
Kconfig	memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE	2016-05-19 19:12:14 -07:00
Kconfig.debug	mm/page_ref: add tracepoint to track down page reference manipulation	2016-03-17 15:09:34 -07:00
Makefile	mm, kasan: SLAB support	2016-03-25 16:37:42 -07:00
backing-dev.c	writeback: fix the wrong congested state variable definition	2016-03-31 12:26:25 -06:00
balloon_compaction.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial	2016-03-17 21:38:27 -07:00
bootmem.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
cleancache.c	cleancache: constify cleancache_ops structure	2016-01-27 09:09:57 -05:00
cma.c	mm/cma.c: suppress warning	2015-11-05 19:34:48 -08:00
cma.h	mm: cma: mark cma_bitmap_maxno() inline in header	2015-08-14 15:56:32 -07:00
cma_debug.c	mm/cma_debug: correct size input to bitmap function	2015-07-17 16:39:54 -07:00
compaction.c	mm, page_alloc: remove field from alloc_context	2016-05-19 19:12:14 -07:00
debug.c	mm: introduce page reference manipulation functions	2016-03-17 15:09:34 -07:00
debug_page_ref.c	mm/page_ref: add tracepoint to track down page reference manipulation	2016-03-17 15:09:34 -07:00
dmapool.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
early_ioremap.c	mm/early_ioremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
fadvise.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
failslab.c	mm: fault-inject take over bootstrap kmem_cache check	2016-03-15 16:55:16 -07:00
filemap.c	mm: filemap: only do access activations on reads	2016-05-20 17:58:30 -07:00
frame_vector.c	mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm	2016-02-16 10:11:12 +01:00
frontswap.c	frontswap: allow multiple backends	2015-06-24 17:49:45 -07:00
gup.c	Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-04-14 19:31:34 -07:00
highmem.c	mm/highmem: make nr_free_highpages() handles all highmem zones by itself	2016-05-19 19:12:14 -07:00
huge_memory.c	huge mm: move_huge_pmd does not need new_vma	2016-05-19 19:12:14 -07:00
hugetlb.c	mm/hugetlb: add same zone check in pfn_range_valid_gigantic()	2016-05-19 19:12:14 -07:00
hugetlb_cgroup.c	mm: make compound_head() robust	2015-11-06 17:50:42 -08:00
hwpoison-inject.c	hwpoison: use page_cgroup_ino for filtering by memcg	2015-09-10 13:29:01 -07:00
init-mm.c	…
internal.h	mm, page_alloc: remove field from alloc_context	2016-05-19 19:12:14 -07:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
kmemcheck.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
kmemleak-test.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
kmemleak.c	mm: coalesce split strings	2016-03-17 15:09:34 -07:00
ksm.c	ksm: fix conflict between mmput and scan_get_next_rmap_item	2016-05-12 15:52:50 -07:00
list_lru.c	mm: memcontrol: move kmem accounting code to CONFIG_MEMCG	2016-01-20 17:09:18 -08:00
maccess.c	mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe	2015-11-05 19:34:48 -08:00
madvise.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
memblock.c	mm: coalesce split strings	2016-03-17 15:09:34 -07:00
memcontrol.c	oom, oom_reaper: try to reap tasks which skip regular OOM killer path	2016-05-19 19:12:14 -07:00
memory-failure.c	mm/memory-failure: fix race with compound page split/merge	2016-04-28 19:34:04 -07:00
memory.c	mm: thp: calculate the mapcount correctly for THP pages during WP faults	2016-05-12 15:52:50 -07:00
memory_hotplug.c	memory_hotplug: introduce memhp_default_state= command line parameter	2016-05-19 19:12:14 -07:00
mempolicy.c	mm, page_alloc: avoid looking up the first zone in a zonelist twice	2016-05-19 19:12:14 -07:00
mempool.c	mm, kasan: add GFP flags to KASAN API	2016-03-25 16:37:42 -07:00
memtest.c	memtest: remove unused header files	2015-09-08 15:35:28 -07:00
migrate.c	mm: use __SetPageSwapBacked and dont ClearPageSwapBacked	2016-05-19 19:12:14 -07:00
mincore.c	mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage	2016-04-04 10:41:08 -07:00
mlock.c	mm: fix mlock accouting	2016-01-21 17:20:51 -08:00
mm_init.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
mmap.c	mm/mmap: kill hook arch_rebalance_pgtables()	2016-05-19 19:12:14 -07:00
mmu_context.c	mm/mmu_context, sched/core: Fix mmu_context.h assumption	2016-04-28 11:44:19 +02:00
mmu_notifier.c	fix Christoph's email addresses	2016-03-17 15:09:34 -07:00
mmzone.c	mm, page_alloc: inline the fast path of the zonelist iterator	2016-05-19 19:12:14 -07:00
mprotect.c	mm/mprotect.c: don't imply PROT_EXEC on non-exec fs	2016-03-22 15:36:02 -07:00
mremap.c	huge pagecache: extend mremap pmd rmap lockout to files	2016-05-19 19:12:14 -07:00
msync.c	mm/msync: use offset_in_page macro	2015-11-05 19:34:48 -08:00
nobootmem.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
nommu.c	Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-04-14 19:31:34 -07:00
oom_kill.c	mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper	2016-05-19 19:12:14 -07:00
page-writeback.c	mm/writeback: correct dirty page calculation for highmem	2016-05-19 19:12:14 -07:00
page_alloc.c	mm: vmscan: reduce size of inactive file list	2016-05-20 17:58:30 -07:00
page_counter.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
page_ext.c	mm/page_poisoning.c: allow for zero poisoning	2016-03-15 16:55:16 -07:00
page_idle.c	mm: add page_check_address_transhuge() helper	2016-01-15 17:56:32 -08:00
page_io.c	Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2016-05-17 15:05:23 -07:00
page_isolation.c	mm/memory_hotplug: add comment to some functions related to memory hotplug	2016-05-19 19:12:14 -07:00
page_owner.c	mm, page_alloc: inline pageblock lookup in page free fast paths	2016-05-19 19:12:14 -07:00
page_poison.c	mm/page_poisoning.c: allow for zero poisoning	2016-03-15 16:55:16 -07:00
pagewalk.c	thp: rename split_huge_page_pmd() to split_huge_pmd()	2016-01-15 17:56:32 -08:00
percpu-km.c	mm: percpu: use pr_fmt to prefix output	2016-03-17 15:09:34 -07:00
percpu-vm.c	percpu: move region iterations out of pcpu_[de]populate_chunk()	2014-09-02 14:46:02 -04:00
percpu.c	mm: percpu: use pr_fmt to prefix output	2016-03-17 15:09:34 -07:00
pgtable-generic.c	mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range	2016-03-17 15:09:34 -07:00
process_vm_access.c	mm/gup: Introduce get_user_pages_remote()	2016-02-16 10:04:09 +01:00
quicklist.c	fix Christoph's email addresses	2016-03-17 15:09:34 -07:00
readahead.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
rmap.c	mm: use __SetPageSwapBacked and dont ClearPageSwapBacked	2016-05-19 19:12:14 -07:00
shmem.c	tmpfs: mem_cgroup charge fault to vm_mm not current mm	2016-05-19 19:12:14 -07:00
slab.c	include/linux/nodemask.h: create next_node_in() helper	2016-05-19 19:12:14 -07:00
slab.h	mm, kasan: add GFP flags to KASAN API	2016-03-25 16:37:42 -07:00
slab_common.c	mm, kasan: add GFP flags to KASAN API	2016-03-25 16:37:42 -07:00
slob.c	mm: slab: free kmem_cache_node after destroy sysfs file	2016-02-18 16:23:24 -08:00
slub.c	mm: rename _count, field of the struct page, to _refcount	2016-05-19 19:12:14 -07:00
sparse-vmemmap.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
sparse.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
swap.c	thp: keep huge zero page pinned until tlb flush	2016-04-28 19:34:04 -07:00
swap_cgroup.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
swap_state.c	mm: use __SetPageSwapBacked and dont ClearPageSwapBacked	2016-05-19 19:12:14 -07:00
swapfile.c	mm: thp: calculate the mapcount correctly for THP pages during WP faults	2016-05-12 15:52:50 -07:00
truncate.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
userfaultfd.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
util.c	mm: uninline page_mapped()	2016-05-19 19:12:14 -07:00
vmacache.c	mm/vmacache: inline vmacache_valid_mm()	2015-11-05 19:34:48 -08:00
vmalloc.c	mm/vmalloc: use PAGE_ALIGNED() to check PAGE_SIZE alignment	2016-03-17 15:09:34 -07:00
vmpressure.c	mm/vmpressure.c: fix subtree pressure detection	2016-02-03 08:28:43 -08:00
vmscan.c	vmscan: consider classzone_idx in compaction_ready	2016-05-20 17:58:30 -07:00
vmstat.c	mm, page_alloc: inline pageblock lookup in page free fast paths	2016-05-19 19:12:14 -07:00
workingset.c	mm: workingset: make shadow node shrinker memcg aware	2016-03-17 15:09:34 -07:00
zbud.c	mm/zbud.c: use list_last_entry() instead of list_tail_entry()	2016-01-15 11:40:52 -08:00
zpool.c	mm: zsmalloc: constify struct zs_pool name	2015-11-06 17:50:42 -08:00
zsmalloc.c	zsmalloc: fix zs_can_compact() integer overflow	2016-05-09 17:40:59 -07:00
zswap.c	mm/zswap: provide unique zpool name	2016-05-05 17:38:53 -07:00