OpenCloudOS-Kernel

History

Johannes Weiner 95f9ab2d59 mm: workingset: don't drop refault information prematurely Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is always some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which nobody is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A lot can happen on a CPU in 5 seconds. This may be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the current page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2018-10-26 16:26:32 -07:00
..
kasan	kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN	2018-08-17 16:20:30 -07:00
Kconfig	mm: disable deferred struct page for 32-bit arches	2018-09-20 22:01:11 +02:00
Kconfig.debug	mm: clarify CONFIG_PAGE_POISONING and usage	2018-08-22 10:52:44 -07:00
Makefile	arm64 updates for 4.20:	2018-10-22 17:30:06 +01:00
backing-dev.c	blkcg: delay blkg destruction until after writeback has finished	2018-08-31 14:48:56 -06:00
balloon_compaction.c	virtio_balloon: fix deadlock on OOM	2017-11-14 23:57:38 +02:00
bootmem.c	docs/mm: bootmem: add overview documentation	2018-08-02 12:17:27 -06:00
cleancache.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
cma.c	mm/cma: remove unsupported gfp_mask parameter from cma_alloc()	2018-08-17 16:20:32 -07:00
cma.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
cma_debug.c	mm/cma: remove unsupported gfp_mask parameter from cma_alloc()	2018-08-17 16:20:32 -07:00
compaction.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
debug.c	mm: get rid of vmacache_flush_all() entirely	2018-09-13 15:18:04 -10:00
debug_page_ref.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
dmapool.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
early_ioremap.c	mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep	2017-12-11 14:54:44 +01:00
fadvise.c	vfs: implement readahead(2) using POSIX_FADV_WILLNEED	2018-08-30 20:01:32 +02:00
failslab.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
filemap.c	mm: convert to use vm_fault_t	2018-10-26 16:25:19 -07:00
frame_vector.c	mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'	2017-12-14 16:00:48 -08:00
frontswap.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
gup.c	mm: Change return type int to vm_fault_t for fault handlers	2018-08-23 18:48:44 -07:00
gup_benchmark.c	mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl	2018-10-05 16:32:04 -07:00
highmem.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
hmm.c	libnvdimm-for-4.19_dax-memory-failure	2018-08-25 18:43:59 -07:00
huge_memory.c	mremap: properly flush TLB before releasing the page	2018-10-18 11:30:52 +02:00
hugetlb.c	hugetlb: take PMD sharing into account when flushing tlb/caches	2018-10-05 16:32:04 -07:00
hugetlb_cgroup.c	mm: rename page_counter's count/limit into usage/max	2018-06-07 17:34:35 -07:00
hwpoison-inject.c	mm/memory_failure: Remove unused trapno from memory_failure	2018-01-23 12:17:42 -06:00
init-mm.c	mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids	2018-07-17 09:35:30 +02:00
internal.h	mm: Change return type int to vm_fault_t for fault handlers	2018-08-23 18:48:44 -07:00
interval_tree.c	mm/interval_tree.c: use vma_pages() helper	2018-01-31 17:18:37 -08:00
khugepaged.c	mm: Change return type int to vm_fault_t for fault handlers	2018-08-23 18:48:44 -07:00
kmemleak-test.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
kmemleak.c	kmemleak: add module param to print warnings to dmesg	2018-10-26 16:25:19 -07:00
ksm.c	include/linux/compiler.h: make compiler-.h mutually exclusive	2018-08-22 17:31:34 -07:00
list_lru.c	mm/list_lru: introduce list_lru_shrink_walk_irq()	2018-08-17 16:20:32 -07:00
maccess.c	x86/fault: BUG() when uaccess helpers fault on kernel addresses	2018-09-03 15:12:09 +02:00
madvise.c	mm: madvise(MADV_DODUMP): allow hugetlbfs pages	2018-10-05 16:32:05 -07:00
memblock.c	mm/memblock.c: replace u64 with phys_addr_t where appropriate	2018-08-17 16:20:30 -07:00
memcontrol.c	mm: drain memcg stocks on css offlining	2018-10-26 16:25:19 -07:00
memfd.c	alloc_file(): switch to passing O_... flags instead of FMODE_... mode	2018-07-12 10:02:57 -04:00
memory-failure.c	libnvdimm-for-4.19_dax-memory-failure	2018-08-25 18:43:59 -07:00
memory.c	mm: convert insert_pfn() to vm_fault_t	2018-10-26 16:25:20 -07:00
memory_hotplug.c	mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.	2018-09-04 16:45:02 -07:00
mempolicy.c	userfaultfd: allow get_mempolicy(MPOL_F_NODE\|MPOL_F_ADDR) to trigger userfaults	2018-10-26 16:25:20 -07:00
mempool.c	mm/mempool.c: add missing parameter description	2018-08-22 10:52:44 -07:00
memtest.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
migrate.c	Merge branch 'akpm'	2018-10-05 16:33:03 -07:00
mincore.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mlock.c	dax: remove VM_MIXEDMAP for fsdax and device dax	2018-08-17 16:20:27 -07:00
mm_init.c	mm: access zone->node via zone_to_nid() and zone_set_nid()	2018-08-22 10:52:45 -07:00
mmap.c	mm/mmap.c: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE	2018-10-13 09:31:02 +02:00
mmu_context.c	sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>	2017-03-02 08:42:38 +01:00
mmu_gather.c	mm/memory: Move mmu_gather and TLB invalidation code into its own file	2018-09-07 15:19:25 +01:00
mmu_notifier.c	Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks"	2018-10-26 16:25:19 -07:00
mmzone.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mprotect.c	x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings	2018-06-20 19:10:01 +02:00
mremap.c	mremap: properly flush TLB before releasing the page	2018-10-18 11:30:52 +02:00
msync.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
nobootmem.c	mm/memblock: add a name for memblock flags enumeration	2018-08-02 12:17:27 -06:00
nommu.c	mm: provide a fallback for PAGE_KERNEL_EXEC for architectures	2018-08-17 16:20:29 -07:00
oom_kill.c	Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2018-10-24 11:22:39 +01:00
page-writeback.c	notifier: Remove notifier header file wherever not used	2018-08-30 12:56:40 +02:00
page_alloc.c	mm: rename and change semantics of nr_indirectly_reclaimable_bytes	2018-10-26 16:26:32 -07:00
page_counter.c	memcg: introduce memory.min	2018-06-07 17:34:36 -07:00
page_ext.c	mm/page_ext.c: constify lookup_page_ext() argument	2018-08-17 16:20:28 -07:00
page_idle.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
page_io.c	blkcg: associate a blkg for pages being evicted by swap	2018-09-21 20:29:09 -06:00
page_isolation.c	mm, migrate: remove reason argument from new_page_t	2018-04-11 10:28:32 -07:00
page_owner.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
page_poison.c	mm/page_poison.c: make early_page_poison_param() __init	2018-04-05 21:36:26 -07:00
page_vma_mapped.c	mm, page_vma_mapped: Introduce pfn_in_hpage()	2018-01-22 12:15:57 -08:00
pagewalk.c	mm: kernel-doc: add missing parameter descriptions	2018-04-05 21:36:27 -07:00
percpu-internal.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
percpu-km.c	percpu: allow select gfp to be passed to underlying allocators	2018-02-18 05:33:01 -08:00
percpu-stats.c	treewide: Use array_size() in vmalloc()	2018-06-12 16:19:22 -07:00
percpu-vm.c	percpu: allow select gfp to be passed to underlying allocators	2018-02-18 05:33:01 -08:00
percpu.c	percpu: stop leaking bitmap metadata blocks	2018-10-07 14:50:12 -07:00
pgtable-generic.c	x86/mm: Page size aware flush_tlb_mm_range()	2018-10-09 16:51:11 +02:00
process_vm_access.c	mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors	2018-02-06 18:32:48 -08:00
quicklist.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
readahead.c	vfs: implement readahead(2) using POSIX_FADV_WILLNEED	2018-08-30 20:01:32 +02:00
rmap.c	mm: migration: fix migration of huge PMD shared pages	2018-10-05 16:32:04 -07:00
rodata_test.c	mm: fix RODATA_TEST failure "rodata_test: test data was not read only"	2017-10-03 17:54:24 -07:00
shmem.c	mm: shmem.c: Correctly annotate new inodes for lockdep	2018-09-20 22:01:11 +02:00
slab.c	mm, slab: combine kmalloc_caches and kmalloc_dma_caches	2018-10-26 16:26:31 -07:00
slab.h	mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB	2018-08-17 16:20:30 -07:00
slab_common.c	mm, slab: shorten kmalloc cache names for large sizes	2018-10-26 16:26:32 -07:00
slob.c	slab: __GFP_ZERO is incompatible with a constructor	2018-06-07 17:34:34 -07:00
slub.c	mm, slab: combine kmalloc_caches and kmalloc_dma_caches	2018-10-26 16:26:31 -07:00
sparse-vmemmap.c	mm/sparse: delete old sparse_init and enable new one	2018-08-17 16:20:32 -07:00
sparse.c	mm/sparse: delete old sparse_init and enable new one	2018-08-17 16:20:32 -07:00
swap.c	mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS	2018-05-22 06:59:39 -07:00
swap_cgroup.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
swap_slots.c	mm, swap, get_swap_pages: use entry_size instead of cluster in parameter	2018-08-22 10:52:44 -07:00
swap_state.c	treewide: kvzalloc() -> kvcalloc()	2018-06-12 16:19:22 -07:00
swapfile.c	mm/swapfile.c: clear si->swap_map[] in swap_free_cluster()	2018-10-26 16:25:19 -07:00
truncate.c	page cache: use xa_lock	2018-04-11 10:28:39 -07:00
usercopy.c	usercopy: Allow boot cmdline disabling of hardening	2018-07-04 08:04:52 -07:00
userfaultfd.c	userfaultfd: prevent non-cooperative events vs mcopy_atomic races	2018-06-07 17:34:38 -07:00
util.c	mm: rename and change semantics of nr_indirectly_reclaimable_bytes	2018-10-26 16:26:32 -07:00
vmacache.c	mm: get rid of vmacache_flush_all() entirely	2018-09-13 15:18:04 -10:00
vmalloc.c	mm: provide a fallback for PAGE_KERNEL_EXEC for architectures	2018-08-17 16:20:29 -07:00
vmpressure.c	mm/vmpressure.c: convert to use match_string() helper	2018-06-07 17:34:36 -07:00
vmscan.c	mm: don't miss the last page because of round-off error	2018-10-26 16:25:19 -07:00
vmstat.c	mm: rename and change semantics of nr_indirectly_reclaimable_bytes	2018-10-26 16:26:32 -07:00
workingset.c	mm: workingset: don't drop refault information prematurely	2018-10-26 16:26:32 -07:00
z3fold.c	z3fold: fix reclaim lock-ups	2018-05-11 17:28:45 -07:00
zbud.c	mm: docs: fix parameter names mismatch	2018-02-06 18:32:48 -08:00
zpool.c	mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc	2018-02-21 15:35:43 -08:00
zsmalloc.c	mm/zsmalloc.c: make several functions and a struct static	2018-08-17 16:20:30 -07:00
zswap.c	zswap: re-check zswap_is_full() after do zswap_shrink()	2018-07-26 19:38:03 -07:00