OpenCloudOS-Kernel

History

Johannes Weiner 8a931f8013 mm: memcontrol: recursive memory.low protection Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2020-04-02 09:35:28 -07:00
..
bpf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next	2020-03-30 19:52:37 -07:00
cgroup	mm: memcontrol: recursive memory.low protection	2020-04-02 09:35:28 -07:00
configs	…
debug	Revert "kdb: Get rid of confusing diag msg from "rd" if current task has no regs"	2020-02-06 11:40:09 +00:00
dma	dma-mapping: Fix dma_pgprot() for unencrypted coherent pages	2020-03-17 11:52:58 +01:00
events	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
gcov	Revert "um: Enable CONFIG_CONSTRUCTORS"	2020-01-19 22:42:06 +01:00
irq	x86 entry code updates:	2020-03-30 19:14:28 -07:00
livepatch	New tracing features:	2019-11-27 11:42:01 -08:00
locking	x86 entry code updates:	2020-03-30 19:14:28 -07:00
power	Merge branches 'pm-core', 'pm-sleep', 'pm-acpi' and 'pm-domains'	2020-03-30 14:46:58 +02:00
printk	console: Introduce ->exit() callback	2020-02-11 10:44:22 +01:00
rcu	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2020-03-30 16:17:15 -07:00
sched	CPU (hotplug) updates:	2020-03-30 18:06:39 -07:00
time	timekeeping and timer updates:	2020-03-30 18:51:47 -07:00
trace	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next	2020-03-30 19:52:37 -07:00
.gitignore	Provide in-kernel headers to make extending kernel easier	2019-04-29 16:48:03 +02:00
Kconfig.freezer	treewide: Add SPDX license identifier - Makefile/Kconfig	2019-05-21 10:50:46 +02:00
Kconfig.hz	treewide: Add SPDX license identifier - Makefile/Kconfig	2019-05-21 10:50:46 +02:00
Kconfig.locks	sched/rt, locking: Use CONFIG_PREEMPTION	2019-12-08 14:37:36 +01:00
Kconfig.preempt	sched/Kconfig: Fix spelling mistake in user-visible help text	2019-11-12 11:35:32 +01:00
Makefile	kcov: ignore fault-inject and stacktrace	2020-01-31 10:30:41 -08:00
acct.c	acct: stop using get_seconds()	2019-12-18 18:07:31 +01:00
async.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441	2019-06-05 17:37:17 +02:00
audit.c	audit/stable-5.7 PR 20200330	2020-03-31 15:04:17 -07:00
audit.h	audit: trigger accompanying records when no rules present	2020-03-12 10:42:51 -04:00
audit_fsnotify.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157	2019-05-30 11:26:37 -07:00
audit_tree.c	fsnotify: switch send_to_group() and ->handle_event to const struct qstr *	2019-04-26 13:51:03 -04:00
audit_watch.c	audit: CONFIG_CHANGE don't log internal bookkeeping as an event	2020-02-10 10:46:35 -05:00
auditfilter.c	audit: fix error handling in audit_data_to_entry()	2020-02-22 20:36:47 -05:00
auditsc.c	audit: trigger accompanying records when no rules present	2020-03-12 10:42:51 -04:00
backtracetest.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441	2019-06-05 17:37:17 +02:00
bounds.c	…
capability.c	…
compat.c	y2038: remove unused time32 interfaces	2020-02-21 11:22:15 -08:00
configs.c	proc: convert everything to "struct proc_ops"	2020-02-04 03:05:26 +00:00
context_tracking.c	context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ	2020-02-14 16:05:04 +01:00
cpu.c	CPU (hotplug) updates:	2020-03-30 18:06:39 -07:00
cpu_pm.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 282	2019-06-05 17:36:37 +02:00
crash_core.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230	2019-06-19 17:09:06 +02:00
crash_dump.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
cred.c	Merge branch 'dhowells' (patches from DavidH)	2020-01-14 09:56:31 -08:00
delayacct.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25	2019-05-21 11:52:39 +02:00
dma.c	…
elfcore.c	kernel/elfcore.c: include proper prototypes	2019-09-25 17:51:39 -07:00
exec_domain.c	…
exit.c	timekeeping and timer updates:	2020-03-30 18:51:47 -07:00
extable.c	bpf: Remove bpf_image tree	2020-03-13 12:49:52 -07:00
fail_function.c	fail_function: no need to check return value of debugfs_create functions	2019-06-03 15:49:06 +02:00
fork.c	mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()	2020-04-02 09:35:28 -07:00
freezer.c	Revert "libata, freezer: avoid block device removal while system is frozen"	2019-10-06 09:11:37 -06:00
futex.c	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2020-03-30 16:17:15 -07:00
gen_kheaders.sh	kheaders: explain why include/config/autoconf.h is excluded from md5sum	2019-11-11 20:10:01 +09:00
groups.c	…
hung_task.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
iomem.c	mm/nvdimm: add is_ioremap_addr and use that to check ioremap address	2019-07-12 11:05:40 -07:00
irq_work.c	lockdep: Annotate irq_work	2020-03-21 16:00:24 +01:00
jump_label.c	jump_label: Don't warn on __exit jump entries	2019-08-29 15:10:10 +01:00
kallsyms.c	Kbuild updates for v5.6 (2nd)	2020-02-09 16:05:50 -08:00
kcmp.c	…
kcov.c	kcov: remote coverage support	2019-12-04 19:44:14 -08:00
kexec.c	kexec: add machine_kexec_post_load()	2020-01-08 16:32:55 +00:00
kexec_core.c	kexec: add machine_kexec_post_load()	2020-01-08 16:32:55 +00:00
kexec_elf.c	kexec_elf: support 32 bit ELF files	2019-09-06 23:58:44 +02:00
kexec_file.c	kexec: add machine_kexec_post_load()	2020-01-08 16:32:55 +00:00
kexec_internal.h	kexec: add machine_kexec_post_load()	2020-01-08 16:32:55 +00:00
kheaders.c	kheaders: Move from proc to sysfs	2019-05-24 20:16:01 +02:00
kmod.c	…
kprobes.c	kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic	2020-01-09 12:40:13 +01:00
ksysfs.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 170	2019-05-30 11:26:39 -07:00
kthread.c	kthread: Do not preempt current task if it is going to call schedule()	2020-03-20 13:06:20 +01:00
latencytop.c	proc: convert everything to "struct proc_ops"	2020-02-04 03:05:26 +00:00
module-internal.h	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36	2019-05-24 17:27:11 +02:00
module.c	proc: convert everything to "struct proc_ops"	2020-02-04 03:05:26 +00:00
module_signature.c	MODSIGN: Export module signature definitions	2019-08-05 18:39:56 -04:00
module_signing.c	MODSIGN: Export module signature definitions	2019-08-05 18:39:56 -04:00
notifier.c	x86/mm: split vmalloc_sync_all()	2020-03-21 18:56:06 -07:00
nsproxy.c	ns: Introduce Time Namespace	2020-01-14 12:20:48 +01:00
padata.c	padata: update documentation	2019-12-11 16:37:02 +08:00
panic.c	locking/refcount: Remove unused 'refcount_error_report()' function	2019-11-25 09:15:42 +01:00
params.c	lockdown: Lock down module params that specify hardware parameters (eg. ioport)	2019-08-19 21:54:16 -07:00
pid.c	pid: make ENOMEM return value more obvious	2020-03-09 23:40:05 +01:00
pid_namespace.c	fork: extend clone3() to support setting a PID	2019-11-15 23:49:22 +01:00
profile.c	proc: convert everything to "struct proc_ops"	2020-02-04 03:05:26 +00:00
ptrace.c	ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()	2020-01-18 13:51:39 +01:00
range.c	…
reboot.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
relay.c	…
resource.c	mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range()	2019-09-24 15:54:09 -07:00
rseq.c	rseq: Reject unknown flags on rseq unregister	2019-12-25 10:41:20 +01:00
seccomp.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
signal.c	signal: avoid double atomic counter increments for user accounting	2020-02-26 09:54:03 -08:00
smp.c	cpu/hotplug: Move bringup of secondary CPUs out of smp_init()	2020-03-25 12:59:37 +01:00
smpboot.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
smpboot.h	…
softirq.c	lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}()	2020-03-21 16:03:54 +01:00
stackleak.c	…
stacktrace.c	stacktrace: Get rid of unneeded '!!' pattern	2019-11-11 10:30:59 +01:00
stop_machine.c	stop_machine: Make stop_cpus() static	2020-01-17 10:19:21 +01:00
sys.c	sys/sysinfo: Respect boottime inside time namespace	2020-03-03 19:34:32 +01:00
sys_ni.c	y2038: allow disabling time32 system calls	2019-11-15 14:38:30 +01:00
sysctl-test.c	kunit: allow kunit tests to be loaded as a module	2020-01-09 16:42:29 -07:00
sysctl.c	sysctl/sysrq: Remove __sysrq_enabled copy	2020-03-07 09:52:02 +01:00
sysctl_binary.c	sysctl: Remove the sysctl system call	2019-11-26 13:03:56 -06:00
task_work.c	task_work_run: don't take ->pi_lock unconditionally	2020-03-02 14:06:33 -07:00
taskstats.c	taskstats: fix data-race	2019-12-04 15:18:39 +01:00
test_kprobes.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25	2019-05-21 11:52:39 +02:00
torture.c	CPU (hotplug) updates:	2020-03-30 18:06:39 -07:00
tracepoint.c	The main changes in this release include:	2019-07-18 11:51:00 -07:00
tsacct.c	tsacct: add 64-bit btime field	2019-12-18 18:07:31 +01:00
ucount.c	proc/sysctl: add shared variables for range check	2019-07-18 17:08:07 -07:00
uid16.c	…
uid16.h	…
umh.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
up.c	smp/up: Make smp_call_function_single() match SMP semantics	2020-02-07 15:34:12 +01:00
user-return-notifier.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
user.c	Keyrings namespacing	2019-07-08 19:36:47 -07:00
user_namespace.c	Keyrings namespacing	2019-07-08 19:36:47 -07:00
utsname.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441	2019-06-05 17:37:17 +02:00
utsname_sysctl.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441	2019-06-05 17:37:17 +02:00
watchdog.c	watchdog/softlockup: Enforce that timestamp is valid on boot	2020-01-17 11:19:22 +01:00
watchdog_hld.c	kernel/watchdog_hld.c: hard lockup message should end with a newline	2019-04-19 09:46:05 -07:00
workqueue.c	workqueue: don't use wq_select_unbound_cpu() for bound works	2020-03-10 10:30:51 -04:00
workqueue_internal.h	sched/core, workqueues: Distangle worker accounting from rq lock	2019-04-16 16:55:15 +02:00