OpenCloudOS-Kernel

History

Steven Rostedt b6366f048e sched/rt: Use IPI to trigger RT task push migration instead of pulling When debugging the latencies on a 40 core box, where we hit 300 to 500 microsecond latencies, I found there was a huge contention on the runqueue locks. Investigating it further, running ftrace, I found that it was due to the pulling of RT tasks. The test that was run was the following: cyclictest --numa -p95 -m -d0 -i100 This created a thread on each CPU, that would set its wakeup in iterations of 100 microseconds. The -d0 means that all the threads had the same interval (100us). Each thread sleeps for 100us and wakes up and measures its latencies. cyclictest is maintained at: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git What happened was another RT task would be scheduled on one of the CPUs that was running our test, when the other CPU tests went to sleep and scheduled idle. This caused the "pull" operation to execute on all these CPUs. Each one of these saw the RT task that was overloaded on the CPU of the test that was still running, and each one tried to grab that task in a thundering herd way. To grab the task, each thread would do a double rq lock grab, grabbing its own lock as well as the rq of the overloaded CPU. As the sched domains on this box was rather flat for its size, I saw up to 12 CPUs block on this lock at once. This caused a ripple affect with the rq locks especially since the taking was done via a double rq lock, which means that several of the CPUs had their own rq locks held while trying to take this rq lock. As these locks were blocked, any wakeups or load balanceing on these CPUs would also block on these locks, and the wait time escalated. I've tried various methods to lessen the load, but things like an atomic counter to only let one CPU grab the task wont work, because the task may have a limited affinity, and we may pick the wrong CPU to take that lock and do the pull, to only find out that the CPU we picked isn't in the task's affinity. Instead of doing the PULL, I now have the CPUs that want the pull to send over an IPI to the overloaded CPU, and let that CPU pick what CPU to push the task to. No more need to grab the rq lock, and the push/pull algorithm still works fine. With this patch, the latency dropped to just 150us over a 20 hour run. Without the patch, the huge latencies would trigger in seconds. I've created a new sched feature called RT_PUSH_IPI, which is enabled by default. When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks and having the pulling CPU do the work is implemented. When RT_PUSH_IPI is enabled, the IPI is sent to the overloaded CPU to do a push. To enabled or disable this at run time: # mount -t debugfs nodev /sys/kernel/debug # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features or # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features Update: This original patch would send an IPI to all CPUs in the RT overload list. But that could theoretically cause the reverse issue. That is, there could be lots of overloaded RT queues and one CPU lowers its priority. It would then send an IPI to all the overloaded RT queues and they could then all try to grab the rq lock of the CPU lowering its priority, and then we have the same problem. The latest design sends out only one IPI to the first overloaded CPU. It tries to push any tasks that it can, and then looks for the next overloaded CPU that can push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable tasks that have priorities greater than the source CPU are covered. In case the source CPU lowers its priority again, a flag is set to tell the IPI traversal to restart with the first RT overloaded CPU after the source CPU. Parts-suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joern Engel <joern@purestorage.com> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>		2015-03-23 10:55:22 +01:00
..
bpf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2015-01-27 13:55:36 -08:00
configs	x86: Add "make tinyconfig" to configure the tiniest possible kernel	2014-08-08 16:30:24 -07:00
debug	debug: prevent entering debug mode on panic/exception.	2015-02-19 12:39:03 -06:00
events	perf: Fix context leak in put_event()	2015-03-13 10:02:18 +01:00
gcov	kbuild,gcov: simplify kernel/gcov/Makefile more	2015-01-09 17:25:44 +01:00
irq	genirq / PM: Add flag for shared NO_SUSPEND interrupt lines	2015-03-04 21:42:19 +01:00
livepatch	livepatch: Fix subtle race with coming and going modules	2015-03-17 10:31:54 +01:00
locking	locking/rtmutex: Set state back to running on error	2015-03-01 09:45:06 +01:00
power	PM / sleep: Re-implement suspend-to-idle handling	2015-02-13 23:49:36 +01:00
printk	console: Fix console name size mismatch	2015-03-07 03:39:55 +01:00
rcu	Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-21 10:36:06 -08:00
sched	sched/rt: Use IPI to trigger RT task push migration instead of pulling	2015-03-23 10:55:22 +01:00
time	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-21 11:05:22 -08:00
trace	ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled	2015-03-09 10:55:34 -04:00
.gitignore	…
Kconfig.freezer	…
Kconfig.hz	…
Kconfig.locks	locking/mcs: Better differentiate between MCS variants	2015-01-14 15:07:32 +01:00
Kconfig.preempt	…
Makefile	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security	2015-02-11 20:25:11 -08:00
acct.c	new fs_pin killing logics	2015-01-25 23:17:28 -05:00
async.c	kernel/async.c: switch to pr_foo()	2014-10-09 22:26:04 -04:00
audit.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-12-30 10:45:47 -08:00
audit.h	audit: replace getname()/putname() hacks with reference counters	2015-01-23 00:23:58 -05:00
audit_tree.c	fsnotify: unify inode and mount marks handling	2014-12-13 12:42:53 -08:00
audit_watch.c	audit: invalid op= values for rules	2014-09-23 16:37:53 -04:00
auditfilter.c	Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit	2015-02-11 20:07:47 -08:00
auditsc.c	Merge branch 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2015-02-17 15:27:47 -08:00
backtracetest.c	kernel/backtracetest.c: replace no level printk by pr_info()	2014-06-04 16:54:14 -07:00
bounds.c	page-cgroup: get rid of NR_PCG_FLAGS	2014-08-08 15:57:18 -07:00
capability.c	CAPABILITIES: remove undefined caps from all processes	2014-07-24 21:53:47 +10:00
cgroup.c	kernfs: remove KERNFS_STATIC_NAME	2015-02-13 21:21:36 -08:00
cgroup_freezer.c	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes	2014-07-15 11:05:09 -04:00
compat.c	all arches, signal: move restart_block to struct task_struct	2015-02-12 18:54:12 -08:00
configs.c	…
context_tracking.c	sched: stop the unbound recursion in preempt_schedule_context()	2014-10-28 10:46:05 +01:00
cpu.c	hotplugcpu: Avoid deadlocks by waking active_writer	2015-01-06 11:01:14 -08:00
cpu_pm.c	…
cpuset.c	cpuset: Fix cpuset sched_relax_domain_level	2015-03-02 11:55:04 -05:00
crash_dump.c	crash_dump: Make is_kdump_kernel() accessible from modules	2014-08-25 15:42:19 -07:00
cred.c	…
delayacct.c	delayacct: Remove braindamaged type conversions	2014-07-23 10:18:06 -07:00
dma.c	…
elfcore.c	…
exec_domain.c	kernel/exec_domain.c: code clean-up	2014-06-04 16:54:15 -07:00
exit.c	oom, PM: make OOM detection in the freezer path raceless	2015-02-11 17:06:03 -08:00
extable.c	ftrace/x86/extable: Add is_ftrace_trampoline() function	2014-11-19 15:25:26 -05:00
fork.c	mm: do not use mm->nr_pmds on !MMU configurations	2015-02-12 18:54:10 -08:00
freezer.c	freezer: remove obsolete comments in __thaw_task()	2014-10-21 23:44:20 +02:00
futex.c	all arches, signal: move restart_block to struct task_struct	2015-02-12 18:54:12 -08:00
futex_compat.c	…
groups.c	userns: Don't allow setgroups until a gid mapping has been setablished	2014-12-09 16:58:40 -06:00
hung_task.c	kernel/hung_task.c: convert simple_strtoul to kstrtouint	2014-06-04 16:54:15 -07:00
irq_work.c	percpu: Convert remaining __get_cpu_var uses in 3.18-rcX	2014-10-29 11:18:18 -04:00
jump_label.c	…
kallsyms.c	kernel/kallsyms.c: use __seq_open_private()	2014-10-14 02:18:16 +02:00
kcmp.c	kcmp: fix standard comparison bug	2014-09-10 15:42:12 -07:00
kexec.c	kexec: simplify conditional	2015-02-17 14:34:51 -08:00
kmod.c	usermodehelper: kill the kmod_thread_locker logic	2014-12-10 17:41:17 -08:00
kprobes.c	kprobes: makes kprobes/enabled works correctly for optimized kprobes.	2015-02-13 21:21:42 -08:00
ksysfs.c	kobject: Make support for uevent_helper optional.	2014-04-25 12:00:49 -07:00
kthread.c	kernel/kthread.c: partial revert of `81c98869fa` ("kthread: ensure locality of task_struct allocations")	2014-10-09 22:25:51 -04:00
latencytop.c	kernel/latencytop.c: convert seq_printf to seq_puts	2014-06-04 16:54:15 -07:00
module-internal.h	…
module.c	kasan, module, vmalloc: rework shadow allocation for modules	2015-03-12 18:46:08 -07:00
module_signing.c	…
notifier.c	rcu: Make SRCU optional by using CONFIG_SRCU	2015-01-06 11:04:29 -08:00
nsproxy.c	bury struct proc_ns in fs/proc	2014-12-04 14:34:54 -05:00
padata.c	padata: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:38 -08:00
panic.c	livepatch: kernel: add TAINT_LIVEPATCH	2014-12-22 15:40:48 +01:00
params.c	param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC	2015-01-20 11:38:31 +10:30
pid.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
pid_namespace.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
profile.c	profile: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:38 -08:00
ptrace.c	ptrace: remove linux/compat.h inclusion under CONFIG_COMPAT	2015-02-17 14:34:51 -08:00
range.c	kernel: avoid overflow in cmp_range	2015-01-17 10:02:23 +13:00
reboot.c	kernel: add support for kernel restart handler call chain	2014-09-26 00:00:06 -07:00
relay.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-04-12 14:49:50 -07:00
resource.c	resources: Move struct resource_list_entry from ACPI into resource core	2015-02-05 15:09:25 +01:00
seccomp.c	seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO	2015-02-17 14:34:55 -08:00
signal.c	signal: use current->state helpers	2015-02-17 14:34:51 -08:00
smp.c	Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu	2014-10-15 07:48:18 +02:00
smpboot.c	smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()	2015-01-23 11:33:51 +01:00
smpboot.h	…
softirq.c	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-09 15:24:03 -08:00
stacktrace.c	stacktrace: introduce snprint_stack_trace for buffer output	2014-12-13 12:42:48 -08:00
stop_machine.c	kernel/stop_machine.c: kernel-doc warning fix	2014-06-04 16:54:15 -07:00
sys.c	kernel/sys.c: fix UNAME26 for 4.0	2015-02-28 09:57:51 -08:00
sys_ni.c	syscalls: implement execveat() system call	2014-12-13 12:42:51 -08:00
sysctl.c	mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?	2015-02-10 14:30:34 -08:00
sysctl_binary.c	kernel: add panic_on_warn	2014-12-10 17:41:10 -08:00
system_certificates.S	…
system_keyring.c	KEYS: validate certificate trust only with builtin keys	2014-07-17 09:35:17 -04:00
task_work.c	…
taskstats.c	netlink: make nlmsg_end() and genlmsg_end() void	2015-01-18 01:03:45 -05:00
test_kprobes.c	kernel/test_kprobes.c: use current logging functions	2014-08-08 15:57:18 -07:00
torture.c	torture: Address race in module cleanup	2014-09-16 13:41:06 -07:00
tracepoint.c	tracing: syscall_regfunc() should not skip kernel threads	2014-06-21 00:15:26 -04:00
tsacct.c	sched: Make task->start_time nanoseconds based	2014-07-23 10:18:05 -07:00
uid16.c	groups: Consolidate the setgroups permission checks	2014-12-05 17:19:27 -06:00
up.c	smp: Rename __smp_call_function_single() to smp_call_function_single_async()	2014-02-24 14:47:15 -08:00
user-return-notifier.c	scheduler: Replace __get_cpu_var with this_cpu_ptr	2014-08-26 13:45:45 -04:00
user.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2014-12-17 12:31:40 -08:00
user_namespace.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2014-12-17 12:31:40 -08:00
utsname.c	copy address of proc_ns_ops into ns_common	2014-12-04 14:34:47 -05:00
utsname_sysctl.c	sysctl: convert use of typedef ctl_table to struct ctl_table	2014-06-06 16:08:16 -07:00
watchdog.c	kernel/sched/clock.c: add another clock for use with the soft lockup watchdog	2015-02-12 18:54:13 -08:00
workqueue.c	workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE	2015-03-05 08:04:13 -05:00
workqueue_internal.h	workqueue: rename manager_mutex to attach_mutex	2014-05-20 10:59:32 -04:00