The following commit:
9642d18eee ("nohz: Affine unpinned timers to housekeepers")'
intended to affine unpinned timers to housekeepers:
unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(houserkeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself)
However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
intention to:
unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology)
unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology)
unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers)
This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:
unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers)
unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Fixed the changelog. ]
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee ("nohz: Affine unpinned timers to housekeepers")'
Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pavan reported that in the presence of very light tasks (or cgroups)
the placement of migrated tasks can cause severe fairness issues.
The problem is that enqueue_entity() places the task before it updates
time, thereby it can place the task far in the past (remember that
light tasks will shoot virtual time forward at a high speed, so in
relation to the pre-existing light task, we can land far in the past).
This is done because update_curr() needs the current task, and we
might be placing the current task.
The obvious solution is to differentiate between the current and any
other task; placing the current before we update time, and placing any
other task after, such that !curr tasks end up at the current moment
in time, and not in the past.
This commit re-introduces the previously reverted commit:
3a47d5124a ("sched/fair: Fix fairness issue on migration")
... which is now safe to do, after we've also fixed another
underlying bug first, in:
sched/fair: Prepare to fix fairness problems on migration
and cleaned up other details in the migration code:
sched/core: Kill sched_class::task_waking
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With sched_class::task_waking being called only when we do
set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
and eliminate sched_class::task_waking entirely.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reported that our recent attempt to fix migration problems:
3a47d5124a ("sched/fair: Fix fairness issue on migration")
broke interactivity and the signal starve test. We reverted that
commit and now let's try it again more carefully, with some other
underlying problems fixed first.
One problem is that I assumed ENQUEUE_WAKING was only set when we do a
cross-cpu wakeup (migration), which isn't true. This means we now
destroy the vruntime history of tasks and wakeup-preemption suffers.
Cure this by making my assumption true, only call
sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
the indirect call in the case we do a local wakeup.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Fixes: 3a47d5124a ("sched/fair: Fix fairness issue on migration")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since I want to make ->task_woken() conditional on the task getting
migrated, we cannot use it to call record_wakee().
Move it to select_task_rq_fair(), which gets called in almost all the
same conditions. The only exception is if the woken task (@p) is
CPU-bound (as per the nr_cpus_allowed test in select_task_rq()).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reported that this recent commit:
3a47d5124a ("sched/fair: Fix fairness issue on migration")
... broke interactivity and the signal starvation test.
We have a proper fix series in the works but ran out of time for
v4.6, so revert the commit.
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We got this warning:
WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0
[...]
Call Trace:
dump_stack+0x63/0x87
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
set_task_cpu+0x1af/0x1c0
push_dl_task.part.34+0xea/0x180
push_dl_tasks+0x17/0x30
__balance_callback+0x45/0x5c
__sched_setscheduler+0x906/0xb90
SyS_sched_setattr+0x150/0x190
do_syscall_64+0x62/0x110
entry_SYSCALL64_slow_path+0x25/0x25
This corresponds to:
WARN_ON_ONCE(p->state == TASK_RUNNING &&
p->sched_class == &fair_sched_class &&
(p->on_rq && !task_on_rq_migrating(p)))
It happens because in find_lock_later_rq(), the task whose scheduling
class was changed to fair class is still pushed away as if it were
a deadline task ...
So, check in find_lock_later_rq() after double_lock_balance(), if the
scheduling class of the deadline task was changed, break and retry.
Apply the same logic to RT tasks.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
34e2c555f3 ("cpufreq: Add mechanism for registering utilization update callbacks")
overlooked the fact that update_load_avg(), where CFS invokes cpufreq
utilization update callbacks, becomes an empty stub on UP kernels.
In consequence, if !CONFIG_SMP, cpufreq governors are never invoked
from CFS and they do not have a chance to evaluate CPU performace
levels and update them often enough.
Needless to say, things don't work as expected then.
Fix the problem by making the !CONFIG_SMP stub of update_load_avg()
invoke cpufreq update callbacks too.
Reported-by: Steve Muckle <steve.muckle@linaro.org>
Tested-by: Steve Muckle <steve.muckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Steve Muckle <steve.muckle@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux PM list <linux-pm@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Fixes: 34e2c555f3 (cpufreq: Add mechanism for registering utilization update callbacks)
Link: http://lkml.kernel.org/r/6282396.VVEdgVYxO3@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No need for an extra notifier. We don't need to handle all these states. It's
sufficient to kill the timer when the cpu dies.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.770528462@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The alleged requirement that the migration notifier has a lower priority than
perf is completely undocumented and there is no indication at all that this is
true. perf does not even handle the CPU_ONLINE notification and perf really
has nothing to do with migration.
Move the CPU_ONLINE code into the sched_activate_cpu() state callback.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.421743581@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It really does not matter when we fold the load for the outgoing cpu. It's
almost dead anyway, so there is no harm if we fail to fold the few
microseconds which are required for going fully away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.328739226@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We can piggy pack that on the SCHED_STARTING state. It's not required before
the cpu actually comes online. Name the function proper as it has nothing to
do with migration.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.248226511@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The sync_rcu stuff is specificically for clearing bits in the active
mask, such that everybody will observe the bit cleared and will not
consider the cleared CPU for load-balancing etc.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.169219710@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that we reduced everything into single notifiers, it's simple to move them
into the hotplug state machine space.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is the last operation on the cpu before vanishing. No point in calling
that on CPU_DEAD.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We can maintain the ordering of the scheduler cpu hotplug functionality nicely
in one notifer. Get rid of the maze.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Prevent the SMP scheduler related notifiers to be executed before the smp
scheduler is initialized and install them early.
This is a preparatory change for further consolidation of the hotplug notifier
maze.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Start distangling the maze of hotplug notifiers in the scheduler.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to enable symmetric hotplug, we must mirror the online &&
!active state of cpu-down on the cpu-up side.
However, to retain sanity, limit this state to per-cpu kthreads.
Aside from the change to set_cpus_allowed_ptr(), which allow moving
the per-cpu kthreads on, the other critical piece is the cpu selection
for pinned tasks in select_task_rq(). This avoids dropping into
select_fallback_rq().
select_fallback_rq() cannot be allowed to select !active cpus because
its used to migrate user tasks away. And we do not want to move user
tasks onto cpus that are in transition.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160301152303.GV6356@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The comment in calculate_imbalance() was introduced in commit:
2dd73a4f09 ("[PATCH] sched: implement smpnice")
which described the logic as it was then, but a later commit:
b18855500f ("sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()")
.. complicated this logic some more so that the comment does not match anymore.
Update the comment to match the code.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-3-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 8e7fbcbc22 ("sched: Remove stale power aware scheduling remnants
and dysfunctional knobs") deleted the power aware scheduling support.
This patch gets rid of the remaining power aware scheduling related
comments in the code as well.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-2-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we're accessing rq_clock() (e.g. in sched_avg_update()) we should
update the rq clock before calling cpu_load_update(), otherwise any
time calculations will be stale.
All other paths currently call update_rq_clock().
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The max_idle_balance_cost and avg_idle values which are tracked and ar used to
capture short idle incidents, are not associated with schedstats, however the
information of these two values isn't printed out on !CONFIG_SCHEDSTATS kernels.
Fix this by moving the value printout out of the CONFIG_SCHEDSTATS section.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462250305-4523-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant amount of time.
The program to generate the constants is located at:
Documentation/scheduler/sched-avg.txt
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@arm.com
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1462226078-31904-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After cleaning up the sched metrics, there are two definitions that are
ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT.
Resolve this:
- Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what
it is.
- Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE.
Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com
[ Rewrote the changelog and fixed the build on 32-bit kernels. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.
In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.
The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.
Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike ran into the low load resolution limitation on his big machine.
So reenable these bits; nobody could ever reproduce/analyze the
reported power usage claim and Google has been running with this for
years as well.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The problem with the existing lock pinning is that each pin is of
value 1; this mean you can simply unpin if you know its pinned,
without having any extra information.
This scheme generates a random (16 bit) cookie for each pin and
requires this same cookie to unpin. This means you have to keep the
cookie in context.
No objsize difference for !LOCKDEP kernels.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to be able to pass around more than just the IRQ flags in the
future, add a rq_flags structure.
No difference in code generation for the x86_64-defconfig build I
tested.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
By default, this is the same thing as switch_mm().
x86 will override it as an optimization.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sometimes delta_exec is 0 due to update_curr() is called multiple times,
this is captured by:
u64 delta_exec = rq_clock_task(rq) - curr->se.exec_start;
This patch optimizes the cpufreq update kicker by bailing out when nothing
changed, it will benefit the upcoming schedutil, since otherwise it will
(over)react to the special util/max combination.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461316044-9520-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Chris Metcalf reported a that sched_can_stop_tick() sometimes fails to
re-enable the tick.
His observed problem is that rq->cfs.nr_running can be 1 even though
there are multiple runnable CFS tasks. This happens in the cgroup
case, in which case cfs.nr_running is the number of runnable entities
for that level.
If there is a single runnable cgroup (which can have an arbitrary
number of runnable child entries itself) rq->cfs.nr_running will be 1.
However, looking at that function I think there's more problems with it.
It seems to assume that if there's FIFO tasks, those will run. This is
incorrect. The FIFO task can have a lower prio than an RR task, in which
case the RR task will run.
So the whole fifo_nr_running test seems misplaced, it should go after
the rr_nr_running tests. That is, only if !rr_nr_running, can we use
fifo_nr_running like this.
Reported-by: Chris Metcalf <cmetcalf@mellanox.com>
Tested-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Fixes: 76d92ac305 ("sched: Migrate sched to use new tick dependency mask model")
Link: http://lkml.kernel.org/r/20160421160315.GK24771@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I got a minus(very big) dl_b->total_bw during my deadline tests.
# grep dl /proc/sched_debug
dl_rq[0]:
.dl_nr_running : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : -222297900
Something unusual must have happened.
After some digging, I finally noticed that when changing a deadline
task to normal(cfs), and changing it back to deadline immediately,
after it died, we will got the wrong dl_bw->total_bw.
The root cause is in dl_overflow(), it has:
if (new_bw == p->dl.dl_bw)
return 0;
1) When a deadline task is changed to !deadline task, it will start
dl timer in switched_from_dl(), and retain previous deadline parameter
till the timer expires.
2) If we change it back to deadline with the same bandwidth parameter
before the timer expires, as it keeps the old bandwidth although it
is not a deadline task. dl_overflow() simply returns success without
updating the right data, and got the wrong dl_bw->total_bw.
The solution is simple, if @p is not deadline, don't return.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460636368-1993-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some code in CPU load update only concern NO_HZ configs but it is
built on all configurations. When NO_HZ isn't built, that code is harmless
but just happens to take some useless ressources in CPU and memory:
1) one useless field in struct rq
2) jiffies record on every tick that is never used (cpu_load_update_periodic)
3) decay_load_missed is called two times on every tick to eventually
return immediately with no action taken. And that function is dead
code.
For pure optimization purposes, lets conditionally build the NO_HZ
related code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461080211-16271-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
mode and we try to minimize the ticks as much as possible. The nohz
subsystem uses a confusing terminology with the internal state
"ts->tick_stopped" which is also available through its public interface
with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
reduced with the best effort rather than stopped. In the best case the
tick can indeed be actually stopped but there is no guarantee about that.
If a timer needs to fire one second later, a tick will fire while the
CPU is in nohz mode and this is a very common scenario.
Now this confusion happens to be a problem with CPU load updates:
cpu_load_update_active() doesn't handle nohz ticks correctly because it
assumes that ticks are completely stopped in nohz mode and that
cpu_load_update_active() can't be called in dynticks mode. When that
happens, the whole previous tickless load is ignored and the function
just records the load for the current tick, ignoring potentially long
idle periods behind.
In order to solve this, we could account the current load for the
previous nohz time but there is a risk that we account the load of a
task that got freshly enqueued for the whole nohz period.
So instead, lets record the dynticks load on nohz frame entry so we know
what to record in case of nohz ticks, then use this record to account
the tickless load on nohz ticks and nohz frame end.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The CPU load update related functions have a weak naming convention
currently, starting with update_cpu_load_*() which isn't ideal as
"update" is a very generic concept.
Since two of these functions are public already (and a third is to come)
that's enough to introduce a more conventional naming scheme. So let's
do the following rename instead:
update_cpu_load_*() -> cpu_load_update_*()
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cpufreq hook should be called any time the root CFS rq utilization
changes. This can occur when a task is switched to or from the fair
class, or a task moves between groups or CPUs, but these paths
currently do not call the cpufreq hook.
Fix this by adding the hook to attach_entity_load_avg() and
detach_entity_load_avg().
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[ Added the .update_freq argument to update_cfs_rq_load_avg() to avoid a double cpufreq call. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458858367-2831-1-git-send-email-smuckle@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's no reason to call the cpufreq hook if the root cfs_rq
utilization has not been modified.
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1458606068-7476-2-git-send-email-smuckle@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cpufreq hook should be called whenever the root cfs_rq
utilization changes so update_cfs_rq_load_avg() is a better
place for it. The current location is not invoked in the
enqueue_entity() or update_blocked_averages() paths.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Michael Turquette <mturquette@baylibre.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458606068-7476-1-git-send-email-smuckle@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When asymmetric packing is set in the sched_domain and target CPU is
busy, update_sd_pick_busiest() may not select the busiest runqueue.
When target CPU is busy, find_busiest_group() will ignore checks for
asym packing and may continue to load balance using the currently
selected not-the-busiest runqueue as source runqueue.
Selecting the busiest runqueue as source when the target CPU is busy,
should result in achieving much better load balance.
Also when target CPU is not busy and asymmetric packing is set in sd,
select higher CPU as source CPU for load balancing.
While doing this change, move the check to see if target CPU is busy
into check_asym_packing().
The extent of performance benefit from this change decreases with the
increasing load. However there is benefit in undercommit as well as
overcommit conditions.
1. Record per second ebizzy (32 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
Testcase: Min Max Avg StdDev
ebizzy: 5223767.00 10368236.00 7946971.00 1753094.76
4.6.0-rc2+asym-changes
Testcase: Min Max Avg StdDev %Change
ebizzy: 8617191.00 13872356.00 11383980.00 1783400.89 +24.78%
2. Record per second ebizzy (64 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
Testcase: Min Max Avg StdDev
ebizzy: 6497666.00 18399783.00 10818093.20 4051452.08
4.6.0-rc2+asym-changes
Testcase: Min Max Avg StdDev %Change
ebizzy: 7567365.00 19456937.00 11674063.60 4295407.48 +4.40%
3. Record per second ebizzy (128 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
Testcase: Min Max Avg StdDev
ebizzy: 37073983.00 40341911.00 38776241.80 1259766.82
4.6.0-rc2+asym-changes
Testcase: Min Max Avg StdDev %Change
ebizzy: 38030399.00 41333378.00 39827404.40 1255001.86 +2.54%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1459948660-16073-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_pt_regs() can return NULL for kernel threads, so add a check.
This fixes an oops at boot on ppc64.
Reported-and-Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: htejun@gmail.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: tj@kernel.org
Cc: yangds.fnst@cn.fujitsu.com
Link: http://lkml.kernel.org/r/20160406215950.04bc3f0b@kryten
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The local_clock/cpu_clock functions were changed to prevent a double
identical test with sched_clock_cpu() when HAVE_UNSTABLE_SCHED_CLOCK
is set. That resulted in one line functions.
As these functions are in all the cases one line functions and in the
hot path, it is useful to specify them as static inline in order to
give a strong hint to the compiler.
After verification, it appears the compiler does not inline them
without this hint. Change those functions to static inline.
sched_clock_cpu() is called via the inlined local_clock()/cpu_clock()
functions from sched.h. So any module code including sched.h will
reference sched_clock_cpu(). Thus it must be exported with the
EXPORT_SYMBOL_GPL macro.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460385514-14700-2-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>