Commit Graph

166 Commits

Author SHA1 Message Date
Linus Torvalds af79ad2b1f Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes are:

   - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

   - schedstat debugging enhancements, make it more broadly runtime
     available. (Josh Poimboeuf)

   - More work on asymmetric topology/capacity scheduling. (Morten
     Rasmussen)

   - sched/wait fixes and cleanups. (Oleg Nesterov)

   - PELT (per entity load tracking) improvements. (Peter Zijlstra)

   - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

   - sched/numa enhancements/fixes (Rik van Riel)

   - sched/cputime scalability improvements (Stanislaw Gruszka)

   - Load calculation arithmetics fixes. (Dietmar Eggemann)

   - sched/deadline enhancements (Tommaso Cucinotta)

   - Fix utilization accounting when switching to the SCHED_NORMAL
     policy. (Vincent Guittot)

   - ... plus misc cleanups and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/irqtime: Consolidate irqtime flushing code
  sched/irqtime: Consolidate accounting synchronization with u64_stats API
  u64_stats: Introduce IRQs disabled helpers
  sched/irqtime: Remove needless IRQs disablement on kcpustat update
  sched/irqtime: No need for preempt-safe accessors
  sched/fair: Fix min_vruntime tracking
  sched/debug: Add SCHED_WARN_ON()
  sched/core: Fix set_user_nice()
  sched/fair: Introduce set_curr_task() helper
  sched/core, ia64: Rename set_curr_task()
  sched/core: Fix incorrect utilization accounting when switching to fair class
  sched/core: Optimize SCHED_SMT
  sched/core: Rewrite and improve select_idle_siblings()
  sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  sched/core: Introduce 'struct sched_domain_shared'
  sched/core: Restructure destroy_sched_domain()
  sched/core: Remove unused @cpu argument from destroy_sched_domain*()
  sched/wait: Introduce init_wait_entry()
  sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
  sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
  ...
2016-10-03 13:39:00 -07:00
Wanpeng Li 61c7aca695 sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU
The dl task will be replenished after dl task timer fire and start a
new period. It will be enqueued and to re-evaluate its dependency on
the tick in order to restart it. However, if the CPU is hot-unplugged,
irq_work_queue will splash since the target CPU is offline.

As a result we get:

    WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0
    Call Trace:
     dump_stack+0x99/0xd0
     __warn+0xd1/0xf0
     warn_slowpath_null+0x1d/0x20
     irq_work_queue_on+0xad/0xe0
     tick_nohz_full_kick_cpu+0x44/0x50
     tick_nohz_dep_set_cpu+0x74/0xb0
     enqueue_task_dl+0x226/0x480
     activate_task+0x5c/0xa0
     dl_task_timer+0x19b/0x2c0
     ? push_dl_task.part.31+0x190/0x190

This can be triggered by hot-unplugging the full dynticks CPU which dl
task is running on.

We enqueue the dl task on the offline CPU, because we need to do
replenish for start_dl_timer(). So, as Juri pointed out, we would
need to do is calling replenish_dl_entity() directly, instead of
enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't
be boosted if it was throttled.

This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by
first migrating (set_task_cpu()) and then doing 1 enqueue.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:45 +02:00
Tommaso Cucinotta d8206bb3ff sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear()
These 2 exercise independent code paths and need different arguments.

After this change, you call:

  cpudl_clear(cp, cpu);
  cpudl_set(cp, cpu, dl);

instead of:

  cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */);
  cpudl_set(cp, cpu, dl, 1 /* is_valid */);

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-4-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:43 +02:00
Rafael J. Wysocki 12bde33dbb cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()
All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
as the time argument and some of them check whether or not cpu_of(rq)
is equal to smp_processor_id() before calling it, so rework it to
take a runqueue pointer as the argument and move the rq_clock(rq)
evaluation into it.

Additionally, provide a wrapper checking cpu_of(rq) against
smp_processor_id() for the cpufreq_update_util() callers that
need it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:16:03 +02:00
Rafael J. Wysocki 58919e83c8 cpufreq / sched: Pass flags to cpufreq_update_util()
It is useful to know the reason why cpufreq_update_util() has just
been called and that can be passed as flags to cpufreq_update_util()
and to the ->func() callback in struct update_util_data.  However,
doing that in addition to passing the util and max arguments they
already take would be clumsy, so avoid it.

Instead, use the observation that the schedutil governor is part
of the scheduler proper, so it can access scheduler data directly.
This allows the util and max arguments of cpufreq_update_util()
and the ->func() callback in struct update_util_data to be replaced
with a flags one, but schedutil has to be modified to follow.

Thus make the schedutil governor obtain the CFS utilization
information from the scheduler and use the "RT" and "DL" flags
instead of the special utilization value of ULONG_MAX to track
updates from the RT and DL sched classes.  Make it non-modular
too to avoid having to export scheduler variables to modules at
large.

Next, update all of the other users of cpufreq_update_util()
and the ->func() callback in struct update_util_data accordingly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:14:55 +02:00
Juri Lelli 98b0a85780 sched/deadline: Remove useless parameter from setup_new_dl_entity()
setup_new_dl_entity() takes two parameters, but it only actually uses
one of them, under a different name, to setup a new dl_entity, after:

  2f9f3fdc928 "sched/deadline: Remove dl_new from struct sched_dl_entity"

as we currently do:

  setup_new_dl_entity(&p->dl, &p->dl)

However, before Luca's change we were doing:

  setup_new_dl_entity(dl_se, pi_se)

in update_dl_entity() for a dl_se->new entity: we were using pi_se's
parameters (the potential PI donor) for setting up a new entity.

This change removes the useless second parameter of setup_new_dl_entity().

While we are at it we also optimize things further calling setup_new_dl_
entity() only for already queued tasks, since (as pointed out by Xunlei)
we already do the very same update at tasks wakeup time anyway. By doing
so, we don't need to worry about a potential PI donor anymore, as
rt_mutex_setprio() takes care of that already for us.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xunlei Pang <xpang@redhat.com>
Link: http://lkml.kernel.org/r/1470409675-20935-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:03:32 +02:00
Wanpeng Li c0c8c9fa21 sched/deadline: Fix lock pinning warning during CPU hotplug
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

  WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
  releasing a pinned lock
  Call Trace:
   dump_stack+0x99/0xd0
   __warn+0xd1/0xf0
   ? dl_task_timer+0x1a1/0x2b0
   warn_slowpath_fmt+0x4f/0x60
   ? sched_clock+0x13/0x20
   lock_release+0x690/0x6a0
   ? enqueue_pushable_dl_task+0x9b/0xa0
   ? enqueue_task_dl+0x1ca/0x480
   _raw_spin_unlock+0x1f/0x40
   dl_task_timer+0x1a1/0x2b0
   ? push_dl_task.part.31+0x190/0x190
  WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
  unpinning an unpinned lock
  Call Trace:
   dump_stack+0x99/0xd0
   __warn+0xd1/0xf0
   warn_slowpath_fmt+0x4f/0x60
   lock_unpin_lock+0x181/0x1a0
   dl_task_timer+0x127/0x2b0
   ? push_dl_task.part.31+0x190/0x190

As per the comment before this code, its safe to drop the RQ lock
here, and since we (potentially) change rq, unpin and repin to avoid
the splat.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Rewrote changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:02:55 +02:00
Thomas Gleixner 50605ffbda sched/core: Provide a tsk_nr_cpus_allowed() helper
tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows
us to change the representation of ->nr_cpus_allowed if required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1462969411-17735-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:36 +02:00
Thomas Gleixner ade42e092b sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed
Use the future-safe accessor for struct task_struct's.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1462969411-17735-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:35 +02:00
Ingo Molnar eb60b3e5e8 Merge branch 'sched/urgent' into sched/core to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:18:13 +02:00
Xunlei Pang 13b5ab02ae sched/rt, sched/dl: Don't push if task's scheduling class was changed
We got this warning:

    WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0
    [...]
    Call Trace:

    dump_stack+0x63/0x87
    __warn+0xd1/0xf0
    warn_slowpath_null+0x1d/0x20
    set_task_cpu+0x1af/0x1c0
    push_dl_task.part.34+0xea/0x180
    push_dl_tasks+0x17/0x30
    __balance_callback+0x45/0x5c
    __sched_setscheduler+0x906/0xb90
    SyS_sched_setattr+0x150/0x190
    do_syscall_64+0x62/0x110
    entry_SYSCALL64_slow_path+0x25/0x25

This corresponds to:

    WARN_ON_ONCE(p->state == TASK_RUNNING &&
             p->sched_class == &fair_sched_class &&
             (p->on_rq && !task_on_rq_migrating(p)))

It happens because in find_lock_later_rq(), the task whose scheduling
class was changed to fair class is still pushed away as if it were
a deadline task ...

So, check in find_lock_later_rq() after double_lock_balance(), if the
scheduling class of the deadline task was changed, break and retry.

Apply the same logic to RT tasks.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:02:46 +02:00
Peter Zijlstra e7904a28f5 locking/lockdep, sched/core: Implement a better lock pinning scheme
The problem with the existing lock pinning is that each pin is of
value 1; this mean you can simply unpin if you know its pinned,
without having any extra information.

This scheme generates a random (16 bit) cookie for each pin and
requires this same cookie to unpin. This means you have to keep the
cookie in context.

No objsize difference for !LOCKDEP kernels.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:23:59 +02:00
Peter Zijlstra eb58075149 sched/core: Introduce 'struct rq_flags'
In order to be able to pass around more than just the IRQ flags in the
future, add a rq_flags structure.

No difference in code generation for the x86_64-defconfig build I
tested.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:23:59 +02:00
Wanpeng Li 594dd290cf sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times
Sometimes delta_exec is 0 due to update_curr() is called multiple times,
this is captured by:

	u64 delta_exec = rq_clock_task(rq) - curr->se.exec_start;

This patch optimizes the cpufreq update kicker by bailing out when nothing
changed, it will benefit the upcoming schedutil, since otherwise it will
(over)react to the special util/max combination.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461316044-9520-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28 10:39:54 +02:00
Linus Torvalds 277edbabf6 Power management and ACPI material for v4.6-rc1, part 1
- Redesign of cpufreq governors and the intel_pstate driver to
    make them use callbacks invoked by the scheduler to trigger CPU
    frequency evaluation instead of using per-CPU deferrable timers
    for that purpose (Rafael Wysocki).
 
  - Reorganization and cleanup of cpufreq governor code to make it
    more straightforward and fix some concurrency problems in it
    (Rafael Wysocki, Viresh Kumar).
 
  - Cleanup and improvements of locking in the cpufreq core (Viresh
    Kumar).
 
  - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
    Kumar, Eric Biggers).
 
  - intel_pstate driver updates including fixes, optimizations and a
    modification to make it enable enable hardware-coordinated P-state
    selection (HWP) by default if supported by the processor (Philippe
    Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
    Franciosi).
 
  - Operating Performance Points (OPP) framework updates to improve
    its handling of voltage regulators and device clocks and updates
    of the cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).
 
  - Updates of the powernv cpufreq driver to fix initialization
    and cleanup problems in it and correct its worker thread handling
    with respect to CPU offline, new powernv_throttle tracepoint
    (Shilpasri Bhat).
 
  - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).
 
  - ACPICA updates including one fix for a regression introduced
    by previos changes in the ACPICA code (Bob Moore, Lv Zheng,
    David Box, Colin Ian King).
 
  - Support for installing ACPI tables from initrd (Lv Zheng).
 
  - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
    Chaugule).
 
  - Support for _HID(ACPI0010) devices (ACPI processor containers)
    and ACPI processor driver cleanups (Sudeep Holla).
 
  - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
    Aleksey Makarov).
 
  - Modification of the ACPI PCI IRQ management code to make it treat
    255 in the Interrupt Line register as "not connected" on x86 (as
    per the specification) and avoid attempts to use that value as
    a valid interrupt vector (Chen Fan).
 
  - ACPI APEI fixes related to resource leaks (Josh Hunt).
 
  - Removal of modularity from a few ACPI drivers (BGRT, GHES,
    intel_pmic_crc) that cannot be built as modules in practice (Paul
    Gortmaker).
 
  - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
    as a valid resource type (Harb Abdulhamid).
 
  - New device ID (future AMD I2C controller) in the ACPI driver for
    AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).
 
  - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).
 
  - cpuidle menu governor optimization to avoid a square root
    computation in it (Rasmus Villemoes).
 
  - Fix for potential use-after-free in the generic device properties
    framework (Heikki Krogerus).
 
  - Updates of the generic power domains (genpd) framework including
    support for multiple power states of a domain, fixes and debugfs
    output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
    Geert Uytterhoeven).
 
  - Intel RAPL power capping driver updates to reduce IPI overhead in
    it (Jacob Pan).
 
  - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
    Sengar).
 
  - Year 2038 fix for the process freezer (Abhilash Jindal).
 
  - turbostat utility updates including new features (decoding of more
    registers and CPUID fields, sub-second intervals support, GFX MHz
    and RC6 printout, --out command line option), fixes (syscall jitter
    detection and workaround, reductioin of the number of syscalls made,
    fixes related to Xeon x200 processors, compiler warning fixes) and
    cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJW50NXAAoJEILEb/54YlRxvr8QAIktC9+ft0y5AmU46hDcBWcK
 QutyWJL9X9BS6DWBJZA2qclDYFmhMfi5Fza1se0gQ9TnLB/KrBwHWLsiYoTsb1k+
 nPKf214aPk+qAhkVuyB4leNWML9Qz9n9jwku/EYxWWpgtbSRf3+0ioIKZeWWc/8V
 JvuaOu4O+g/tkmL7QTrnGWBwhIIssAAV85QPsHkx+g68MrCj4UMMzm7z9G21SPXX
 bmP8yIHsczX/XnRsY0W2NSno7Vdk6ImHpDJ26IAZg28WRNPWICHgGYHvB0TTWMvb
 tts+yqfF7/7QLRjT/M8k9CzDBDE/DnVqoZ0fNJ+aYr7hNKF32mtAN+jH9ZB9dl/P
 fEFapJkPxnWyzAoVoB9Dz0rkcZkYMlbxlLWzUGpaPq0JflUUTzLk0ApSjmMn4HRO
 UddwCDdyHTaYThp3gn6GbOb0pIP0SdOVbI1M2QV2x/4PLcT2Ft8Np1+1RFWOeinZ
 Bdl9AE890big0808mqbBzw/buETwr9FjHtCdDPXpP0vJpkBLu3nIYRNb0LCt39es
 mWMp6dFhGgvGj3D3ahTuV3GI8hdpDkh9SObexa11RCjkTKrXcwEmFxHxLeFXwKYq
 alG278bo6cSChRMziS1lis+W/3tsJRN4TXUSv1PPzJHrFgptQVFRStU9ngBKP+pN
 WB+itPc4Fw0YHOrAFsrx
 =cfty
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael Wysocki:
 "This time the majority of changes go into cpufreq and they are
  significant.

  First off, the way CPU frequency updates are triggered is different
  now.  Instead of having to set up and manage a deferrable timer for
  each CPU in the system to evaluate and possibly change its frequency
  periodically, cpufreq governors set up callbacks to be invoked by the
  scheduler on a regular basis (basically on utilization updates).  The
  "old" governors, "ondemand" and "conservative", still do all of their
  work in process context (although that is triggered by the scheduler
  now), but intel_pstate does it all in the callback invoked by the
  scheduler with no need for any additional asynchronous processing.

  Of course, this eliminates the overhead related to the management of
  all those timers, but also it allows the cpufreq governor code to be
  simplified quite a bit.  On top of that, the common code and data
  structures used by the "ondemand" and "conservative" governors are
  cleaned up and made more straightforward and some long-standing and
  quite annoying problems are addressed.  In particular, the handling of
  governor sysfs attributes is modified and the related locking becomes
  more fine grained which allows some concurrency problems to be avoided
  (particularly deadlocks with the core cpufreq code).

  In principle, the new mechanism for triggering frequency updates
  allows utilization information to be passed from the scheduler to
  cpufreq.  Although the current code doesn't make use of it, in the
  works is a new cpufreq governor that will make decisions based on the
  scheduler's utilization data.  That should allow the scheduler and
  cpufreq to work more closely together in the long run.

  In addition to the core and governor changes, cpufreq drivers are
  updated too.  Fixes and optimizations go into intel_pstate, the
  cpufreq-dt driver is updated on top of some modification in the
  Operating Performance Points (OPP) framework and there are fixes and
  other updates in the powernv cpufreq driver.

  Apart from the cpufreq updates there is some new ACPICA material,
  including a fix for a problem introduced by previous ACPICA updates,
  and some less significant changes in the ACPI code, like CPPC code
  optimizations, ACPI processor driver cleanups and support for loading
  ACPI tables from initrd.

  Also updated are the generic power domains framework, the Intel RAPL
  power capping driver and the turbostat utility and we have a bunch of
  traditional assorted fixes and cleanups.

  Specifics:

   - Redesign of cpufreq governors and the intel_pstate driver to make
     them use callbacks invoked by the scheduler to trigger CPU
     frequency evaluation instead of using per-CPU deferrable timers for
     that purpose (Rafael Wysocki).

   - Reorganization and cleanup of cpufreq governor code to make it more
     straightforward and fix some concurrency problems in it (Rafael
     Wysocki, Viresh Kumar).

   - Cleanup and improvements of locking in the cpufreq core (Viresh
     Kumar).

   - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
     Kumar, Eric Biggers).

   - intel_pstate driver updates including fixes, optimizations and a
     modification to make it enable enable hardware-coordinated P-state
     selection (HWP) by default if supported by the processor (Philippe
     Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
     Franciosi).

   - Operating Performance Points (OPP) framework updates to improve its
     handling of voltage regulators and device clocks and updates of the
     cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).

   - Updates of the powernv cpufreq driver to fix initialization and
     cleanup problems in it and correct its worker thread handling with
     respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
     Bhat).

   - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).

   - ACPICA updates including one fix for a regression introduced by
     previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
     Colin Ian King).

   - Support for installing ACPI tables from initrd (Lv Zheng).

   - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
     Chaugule).

   - Support for _HID(ACPI0010) devices (ACPI processor containers) and
     ACPI processor driver cleanups (Sudeep Holla).

   - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
     Aleksey Makarov).

   - Modification of the ACPI PCI IRQ management code to make it treat
     255 in the Interrupt Line register as "not connected" on x86 (as
     per the specification) and avoid attempts to use that value as a
     valid interrupt vector (Chen Fan).

   - ACPI APEI fixes related to resource leaks (Josh Hunt).

   - Removal of modularity from a few ACPI drivers (BGRT, GHES,
     intel_pmic_crc) that cannot be built as modules in practice (Paul
     Gortmaker).

   - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
     as a valid resource type (Harb Abdulhamid).

   - New device ID (future AMD I2C controller) in the ACPI driver for
     AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).

   - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).

   - cpuidle menu governor optimization to avoid a square root
     computation in it (Rasmus Villemoes).

   - Fix for potential use-after-free in the generic device properties
     framework (Heikki Krogerus).

   - Updates of the generic power domains (genpd) framework including
     support for multiple power states of a domain, fixes and debugfs
     output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
     Geert Uytterhoeven).

   - Intel RAPL power capping driver updates to reduce IPI overhead in
     it (Jacob Pan).

   - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
     Sengar).

   - Year 2038 fix for the process freezer (Abhilash Jindal).

   - turbostat utility updates including new features (decoding of more
     registers and CPUID fields, sub-second intervals support, GFX MHz
     and RC6 printout, --out command line option), fixes (syscall jitter
     detection and workaround, reductioin of the number of syscalls
     made, fixes related to Xeon x200 processors, compiler warning
     fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"

* tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
  tools/power turbostat: bugfix: TDP MSRs print bits fixing
  tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
  tools/power turbostat: call __cpuid() instead of __get_cpuid()
  tools/power turbostat: indicate SMX and SGX support
  tools/power turbostat: detect and work around syscall jitter
  tools/power turbostat: show GFX%rc6
  tools/power turbostat: show GFXMHz
  tools/power turbostat: show IRQs per CPU
  tools/power turbostat: make fewer systems calls
  tools/power turbostat: fix compiler warnings
  tools/power turbostat: add --out option for saving output in a file
  tools/power turbostat: re-name "%Busy" field to "Busy%"
  tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
  tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
  tools/power turbostat: allow sub-sec intervals
  ACPI / APEI: ERST: Fixed leaked resources in erst_init
  ACPI / APEI: Fix leaked resources
  intel_pstate: Do not skip samples partially
  intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
  intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
  ...
2016-03-16 14:10:53 -07:00
Rafael J. Wysocki 4ed3900427 Merge branch 'pm-cpufreq'
* pm-cpufreq: (94 commits)
  intel_pstate: Do not skip samples partially
  intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
  intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
  intel_pstate: Optimize calculation for max/min_perf_adj
  intel_pstate: Remove extra conversions in pid calculation
  cpufreq: Move scheduler-related code to the sched directory
  Revert "cpufreq: postfix policy directory with the first CPU in related_cpus"
  cpufreq: Reduce cpufreq_update_util() overhead a bit
  cpufreq: Select IRQ_WORK if CPU_FREQ_GOV_COMMON is set
  cpufreq: Remove 'policy->governor_enabled'
  cpufreq: Rename __cpufreq_governor() to cpufreq_governor()
  cpufreq: Relocate handle_update() to kill its declaration
  cpufreq: governor: Drop unnecessary checks from show() and store()
  cpufreq: governor: Fix race in dbs_update_util_handler()
  cpufreq: governor: Make gov_set_update_util() static
  cpufreq: governor: Narrow down the dbs_data_mutex coverage
  cpufreq: governor: Make dbs_data_mutex static
  cpufreq: governor: Relocate definitions of tuners structures
  cpufreq: governor: Move per-CPU data to the common code
  cpufreq: governor: Make governor private data per-policy
  ...
2016-03-14 14:22:03 +01:00
Rafael J. Wysocki 34e2c555f3 cpufreq: Add mechanism for registering utilization update callbacks
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-03-09 14:39:19 +01:00
Luca Abeni 72f9f3fdc9 sched/deadline: Remove dl_new from struct sched_dl_entity
The dl_new field of struct sched_dl_entity is currently used to
identify new deadline tasks, so that their deadline and runtime
can be properly initialised.

However, these tasks can be easily identified by checking if
their deadline is smaller than the current time when they switch
to SCHED_DEADLINE. So, dl_new can be removed by introducing this
check in switched_to_dl(); this allows to simplify the
SCHED_DEADLINE code.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1457350024-7825-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:24:55 +01:00
Peter Zijlstra 801ccdbf01 sched/deadline: Remove superfluous call to switched_to_dl()
if (A || B) {

	} else if (A && !B) {

	}

If A we'll take the first branch, if !A we will not satisfy the second.
Therefore the second branch will never be taken.

Reported-by: luca abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160225140149.GK6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:11 +01:00
Peter Zijlstra 48be3a67da sched/deadline: Always calculate end of period on sched_yield()
Steven noticed that occasionally a sched_yield() call would not result
in a wait for the next period edge as expected.

It turns out that when we call update_curr_dl() and end up with
delta_exec <= 0, we will bail early and fail to throttle.

Further inspection of the yield code revealed that yield_task_dl()
clearing dl.runtime is wrong too, it will not account the last bit of
runtime which could result in dl.runtime < 0, which in turn means that
replenish would gift us with too much runtime.

Fix both issues by not relying on the dl.runtime value for yield.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160223122822.GP6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:41:51 +01:00
Steven Rostedt c219b7ddb6 sched/deadline: Fix trivial typo in printk() message
It's "too much" not "to much".

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160210120422.4ca77e68@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-17 10:02:00 +01:00
Wanpeng Li 7d92de3a82 sched/deadline: Fix the earliest_dl.next logic
earliest_dl.next should cache deadline of the earliest ready task that
is also enqueued in the pushable rbtree, as pull algorithm uses this
information to find candidates for migration: if the earliest_dl.next
deadline of source rq is earlier than the earliest_dl.curr deadline of
destination rq, the task from the source rq can be pulled.

However, current implementation only guarantees that earliest_dl.next is
the deadline of the next ready task instead of the next pushable task;
which will result in potentially holding both rqs' lock and find nothing
to migrate because of affinity constraints. In addition, current logic
doesn't update the next candidate for pushing in pick_next_task_dl(),
even if the running task is never eligible.

This patch fixes both problems by updating earliest_dl.next when
pushable dl task is enqueued/dequeued, similar to what we already do for
RT.

Tested-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449135730-27202-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06 11:05:56 +01:00
Peter Zijlstra 0aaafaabfc sched/core: Add missing lockdep_unpin() annotations
Luca and Wanpeng reported two missing annotations that led to
false lockdep complaints. Add the missing annotations.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: cbce1a6867 ("sched,lockdep: Employ lock pinning")
Link: http://lkml.kernel.org/r/20151023095008.GY17308@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-23 12:02:10 +02:00
Luca Abeni 5aa5050787 sched/deadline: Fix migration of SCHED_DEADLINE tasks
Commit:

  9d51426242 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")

broke select_task_rq_dl() and find_lock_later_rq(), because it introduced
a comparison between the local task's deadline and dl.earliest_dl.curr of
the remote queue.

However, if the remote runqueue does not contain any SCHED_DEADLINE
task its earliest_dl.curr is 0 (always smaller than the deadline of
the local task) and the remote runqueue is not selected for pushing.

As a result, if an application creates multiple SCHED_DEADLINE
threads, they will never be pushed to runqueues that do not already
contain SCHED_DEADLINE tasks.

This patch fixes the issue by checking if dl.dl_nr_running == 0.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 9d51426242 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")
Link: http://lkml.kernel.org/r/1444982781-15608-1-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:13:36 +02:00
Andrea Parri ff277d4250 sched/deadline: Fix comment in enqueue_task_dl()
The "dl_boosted" flag is set by comparing *absolute* deadlines
(c.f., rt_mutex_setprio()).

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438782979-9057-2-git-send-email-parri.andrea@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Andrea Parri 4ffa08ed4c sched/deadline: Fix comment in push_dl_tasks()
The comment is "misleading"; fix it by adapting a comment from
push_rt_tasks().

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438782979-9057-1-git-send-email-parri.andrea@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Peter Zijlstra 6c37067e27 sched: Change the sched_class::set_cpus_allowed() calling context
Change the calling context of sched_class::set_cpus_allowed() such
that we can assume the task is inactive.

This allows us to easily make changes that affect accounting done by
enqueue/dequeue. This does in fact completely remove
set_cpus_allowed_rt() and greatly reduces set_cpus_allowed_dl().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.667516139@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Peter Zijlstra c5b2803840 sched: Make sched_class::set_cpus_allowed() unconditional
Give every class a set_cpus_allowed() method, this enables some small
optimization in the RT,DL implementation by avoiding a double
cpumask_weight() call.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Xunlei Pang 3fe33bcdd3 sched/deadline: Remove a redundant condition from task_woken_dl()
'p' has been already queued at this point, so "!task_running(rq, p)"
and "p->nr_cpus_allowed > 1" imply that "has_pushable_dl_tasks(rq)"
is true, so it can be removed.

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435995563-3723-2-git-send-email-xlpang@126.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:20 +02:00
Linus Torvalds 98ec21a018 Merge branch 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
 "This series of scheduler updates depends on sched/core and timers/core
  branches, which are already in your tree:

   - Scheduler balancing overhaul to plug a hard to trigger race which
     causes an oops in the balancer (Peter Zijlstra)

   - Lockdep updates which are related to the balancing updates (Peter
     Zijlstra)"

* 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched,lockdep: Employ lock pinning
  lockdep: Implement lock pinning
  lockdep: Simplify lock_release()
  sched: Streamline the task migration locking a little
  sched: Move code around
  sched,dl: Fix sched class hopping CBS hole
  sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
  sched,dl: Remove return value from pull_dl_task()
  sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
  sched,rt: Remove return value from pull_rt_task()
  sched: Allow balance callbacks for check_class_changed()
  sched: Use replace normalize_task() with __sched_setscheduler()
  sched: Replace post_schedule with a balance callback list
2015-06-24 15:09:40 -07:00
Linus Torvalds 43224b96af Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather largish update for everything time and timer related:

   - Cache footprint optimizations for both hrtimers and timer wheel

   - Lower the NOHZ impact on systems which have NOHZ or timer migration
     disabled at runtime.

   - Optimize run time overhead of hrtimer interrupt by making the clock
     offset updates smarter

   - hrtimer cleanups and removal of restrictions to tackle some
     problems in sched/perf

   - Some more leap second tweaks

   - Another round of changes addressing the 2038 problem

   - First step to change the internals of clock event devices by
     introducing the necessary infrastructure

   - Allow constant folding for usecs/msecs_to_jiffies()

   - The usual pile of clockevent/clocksource driver updates

  The hrtimer changes contain updates to sched, perf and x86 as they
  depend on them plus changes all over the tree to cleanup API changes
  and redundant code, which got copied all over the place.  The y2038
  changes touch s390 to remove the last non 2038 safe code related to
  boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  clocksource: Increase dependencies of timer-stm32 to limit build wreckage
  timer: Minimize nohz off overhead
  timer: Reduce timer migration overhead if disabled
  timer: Stats: Simplify the flags handling
  timer: Replace timer base by a cpu index
  timer: Use hlist for the timer wheel hash buckets
  timer: Remove FIFO "guarantee"
  timers: Sanitize catchup_timer_jiffies() usage
  hrtimer: Allow hrtimer::function() to free the timer
  seqcount: Introduce raw_write_seqcount_barrier()
  seqcount: Rename write_seqcount_barrier()
  hrtimer: Fix hrtimer_is_queued() hole
  hrtimer: Remove HRTIMER_STATE_MIGRATE
  selftest: Timers: Avoid signal deadlock in leap-a-day
  timekeeping: Copy the shadow-timekeeper over the real timekeeper last
  clockevents: Check state instead of mode in suspend/resume path
  selftests: timers: Add leap-second timer edge testing to leap-a-day.c
  ntp: Do leapsecond adjustment in adjtimex read path
  time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
  ntp: Introduce and use SECS_PER_DAY macro instead of 86400
  ...
2015-06-22 18:57:44 -07:00
Zhiqiang Zhang 6fab541019 sched/deadline: Remove needless parameter in dl_runtime_exceeded()
Sine commit 269ad8015a ("sched/deadline: Avoid double-accounting in
case of missed deadlines), parameter 'rq' is no longer used, so
remove it.

Signed-off-by: Zhiqiang Zhang <zhangzhiqiang.zhang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <juri.lelli@gmail.com>
Cc: <luca.abeni@unitn.it>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434338120-43773-1-git-send-email-zhangzhiqiang.zhang@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19 10:06:48 +02:00
Wanpeng Li 9d51426242 sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
This patch adds a check that prevents futile attempts to move DL tasks
to a CPU with active tasks of equal or earlier deadline. The same
behavior as commit 80e3d87b2c ("sched/rt: Reduce rq lock contention
by eliminating locking of non-feasible target") for rt class.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431496867-4194-3-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19 10:06:46 +02:00
Wanpeng Li a6c0e746fb sched/deadline: Make init_sched_dl_class() __init
It's a bootstrap function, make init_sched_dl_class() __init.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431496867-4194-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19 10:06:46 +02:00
Wanpeng Li 8b5e770ed7 sched/deadline: Optimize pull_dl_task()
pull_dl_task() uses pick_next_earliest_dl_task() to select a migration
candidate; this is sub-optimal since the next earliest task -- as per
the regular runqueue -- might not be migratable at all. This could
result in iterating the entire runqueue looking for a task.

Instead iterate the pushable queue -- this queue only contains tasks
that have at least 2 cpus set in their cpus_allowed mask.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431496867-4194-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19 10:06:45 +02:00
Peter Zijlstra cbce1a6867 sched,lockdep: Employ lock pinning
Employ the new lockdep lock pinning annotation to ensure no
'accidental' lock-breaks happen with rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:27 +02:00
Peter Zijlstra a649f237db sched,dl: Fix sched class hopping CBS hole
We still have a few pending issues with the deadline code, one of which
is that switching between scheduling classes can 'leak' CBS state.

Close the hole by retaining the current CBS state when leaving
SCHED_DEADLINE and unconditionally programming the deadline timer.
The timer will then reset the CBS state if the task is still
!SCHED_DEADLINE by the time it hits.

If the task left SCHED_DEADLINE it will not call task_dead_dl() and
we'll not cancel the hrtimer, leaving us a pending timer in free
space. Avoid this by giving the timer a task reference, this avoids
littering the task exit path for this rather uncommon case.

In order to do this, I had to move dl_task_offline_migration() below
the replenishment, such that the task_rq()->lock fully covers that.
While doing this, I noticed that it (was) buggy in assuming a task is
enqueued and or we need to enqueue the task now. Fixing this means
select_task_rq_dl() might encounter an offline rq -- look into that.

As a result this kills cancel_dl_timer() which included a rq->lock
break.

Fixes: 40767b0dc7 ("sched/deadline: Fix deadline parameter modification handling")
Cc: Wanpeng Li <wanpeng.li@linux.intel.com>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124743.574192138@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Peter Zijlstra 9916e21499 sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
Remove the direct {push,pull} balancing operations from
switched_{from,to}_dl() / prio_changed_dl() and use the balance
callback queue.

Again, err on the side of too many reschedules; since too few is a
hard bug while too many is just annoying.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124742.968262663@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Peter Zijlstra 0ea60c2054 sched,dl: Remove return value from pull_dl_task()
In order to be able to use pull_dl_task() from a callback, we need to
do away with the return value.

Since the return value indicates if we should reschedule, do this
inside the function. Since not all callers currently do this, this can
increase the number of reschedules due rt balancing.

Too many reschedules is not a correctness issues, too few are.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124742.859398977@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Peter Zijlstra e3fca9e7cb sched: Replace post_schedule with a balance callback list
Generalize the post_schedule() stuff into a balance callback list.
This allows us to more easily use it outside of schedule() and cross
sched_class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Thomas Gleixner 624bbdfac9 Merge branch 'timers/core' into sched/hrtimers
Merge sched/core and timers/core so we can apply the sched balancing
patch queue, which depends on both.
2015-06-19 00:17:47 +02:00
Jason Low 316c1608d1 sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:11:32 +02:00
Thomas Gleixner cc9684d3c1 sched: deadline: Use hrtimer_start()
hrtimer_start() does not longer defer already expired timers to the
softirq. Get rid of the __hrtimer_start_range_ns() invocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.627353666@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Wanpeng Li fa9c9d10e9 sched/deadline: Support DL task migration during CPU hotplug
I observed that DL tasks can't be migrated to other CPUs during CPU
hotplug, in addition, task may/may not be running again if CPU is
added back.

The root cause which I found is that DL tasks will be throtted and
removed from the DL rq after comsuming all their budget, which
leads to the situation that stop task can't pick them up from the
DL rq and migrate them to other CPUs during hotplug.

The method to reproduce:

  schedtool -E -t 50000:100000 -e ./test

Actually './test' is just a simple for loop. Then observe which CPU the
test task is on and offline it:

  echo 0 > /sys/devices/system/cpu/cpuN/online

This patch adds the DL task migration during CPU hotplug by finding a
most suitable later deadline rq after DL timer fires if current rq is
offline.

If it fails to find a suitable later deadline rq then it falls back to
any eligible online CPU in so that the deadline task will come back
to us, and the push/pull mechanism should then move it around properly.

Suggested-and-Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:57 +02:00
Juri Lelli 4cd57f9713 sched/deadline: Always enqueue on previous rq when dl_task_timer() fires
dl_task_timer() may fire on a different rq from where a task was removed
after throttling. Since the call path is:

  dl_task_timer() ->
    enqueue_task_dl() ->
      enqueue_dl_entity() ->
        replenish_dl_entity()

and replenish_dl_entity() uses dl_se's rq, we can't use current's rq
in dl_task_timer(), but we need to lock the task's previous one.

Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Fixes: 3960c8c0c7 ("sched: Make dl_task_time() use task_rq_lock()")
Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:56 +02:00
Abel Vesa 07c54f7a7f sched/core: Remove unused argument from init_[rt|dl]_rq()
Obviously, 'rq' is not used in these two functions, therefore,
there is no reason for it to be passed as an argument.

Signed-off-by: Abel Vesa <abelvesa@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:55 +02:00
Wanpeng Li bd4bde14b9 sched/deadline: Avoid a superfluous check
Since commit 40767b0dc7 ("sched/deadline: Fix deadline parameter
modification handling") we clear the thottled state when switching
from a dl task, therefore we should never find it set in switching to
a dl task.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:12 +01:00
Wanpeng Li 44fb085bfa sched/deadline: Add rq->clock update skip for dl task yield
This patch adds rq->clock update skip for SCHED_DEADLINE task yield,
to tell update_rq_clock() that we've just updated the clock, so that
we don't do a microscopic update in schedule() and double the
fastpath cost.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425961200-3809-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-10 05:46:50 +01:00
Kirill Tkhai 6f1607f1bd sched/dl: Do update_rq_clock() in yield_task_dl()
update_curr_dl() needs actual rq clock.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1423040972.18770.10.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18 16:17:12 +01:00
Kirill Tkhai a79ec89fd8 sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()
A deadline task may be throttled and dequeued at the same time.
This happens, when it becomes throttled in schedule(), which
is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
    deactivate_task()
        dequeue_task_dl()
            update_curr_dl()
                start_dl_timer()
            __dequeue_task_dl()
    prev->on_rq = 0;

Later the timer fires, but the task is still dequeued:

dl_task_timer()
    enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */

Someone wakes it up:

try_to_wake_up()

    enqueue_dl_entity()
        BUG_ON(on_dl_rq())

Patch fixes this problem, it prevents queueing !on_rq tasks
on dl_rq.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Wrote comment. ]
Cc: Juri Lelli <juri.lelli@arm.com>
Fixes: 1019a359d3 ("sched/deadline: Fix stale yield state")
Link: http://lkml.kernel.org/r/1374601424090314@web4j.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18 14:27:31 +01:00
Peter Zijlstra 3960c8c0c7 sched: Make dl_task_time() use task_rq_lock()
Kirill reported that a dl task can be throttled and dequeued at the
same time. This happens, when it becomes throttled in schedule(),
which is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
    deactivate_task()
        dequeue_task_dl()
            update_curr_dl()
                start_dl_timer()
            __dequeue_task_dl()
    prev->on_rq = 0;

This invalidates the assumption from commit 0f397f2c90 ("sched/dl:
Fix race in dl_task_timer()"):

  "The only reason we don't strictly need ->pi_lock now is because
   we're guaranteed to have p->state == TASK_RUNNING here and are
   thus free of ttwu races".

And therefore we have to use the full task_rq_lock() here.

This further amends the fact that we forgot to update the rq lock loop
for TASK_ON_RQ_MIGRATE, from commit cca26e8009 ("sched: Teach
scheduler to understand TASK_ON_RQ_MIGRATING state").

Reported-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-18 14:27:30 +01:00
Peter Zijlstra 1019a359d3 sched/deadline: Fix stale yield state
When we fail to start the deadline timer in update_curr_dl(), we
forget to clear ->dl_yielded, resulting in wrecked time keeping.

Since the natural place to clear both ->dl_yielded and ->dl_throttled
is in replenish_dl_entity(); both are after all waiting for that event;
make it so.

Luckily since 67dfa1b756 ("sched/deadline: Implement
cancel_dl_timer() to use in switched_from_dl()") the
task_on_rq_queued() condition in dl_task_timer() must be true, and can
therefore call enqueue_task_dl() unconditionally.

Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-4-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 07:52:26 +01:00
Wanpeng Li a7bebf4887 sched/deadline: Fix hrtick for a non-leftmost task
After update_curr_dl() the current task might not be the leftmost task
anymore. In that case do not start a new hrtick for it.

In this case NEED_RESCHED will be set and the next schedule will start
the hrtick for the new task if and when appropriate.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Acked-by: Juri Lelli <juri.lelli@arm.com>
[ Rewrote the changelog and comment. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 07:52:25 +01:00
Ingo Molnar 4c195c8a19 Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 07:44:00 +01:00
Peter Zijlstra 40767b0dc7 sched/deadline: Fix deadline parameter modification handling
Commit 67dfa1b756 ("sched/deadline: Implement cancel_dl_timer() to
use in switched_from_dl()") removed the hrtimer_try_cancel() function
call out from init_dl_task_timer(), which gets called from
__setparam_dl().

The result is that we can now re-init the timer while its active --
this is bad and corrupts timer state.

Furthermore; changing the parameters of an active deadline task is
tricky in that you want to maintain guarantees, while immediately
effective change would allow one to circumvent the CBS guarantees --
this too is bad, as one (bad) task should not be able to affect the
others.

Rework things to avoid both problems. We only need to initialize the
timer once, so move that to __sched_fork() for new tasks.

Then make sure __setparam_dl() doesn't affect the current running
state but only updates the parameters used to calculate the next
scheduling period -- this guarantees the CBS functions as expected
(albeit slightly pessimistic).

This however means we need to make sure __dl_clear_params() needs to
reset the active state otherwise new (and tasks flipping between
classes) will not properly (re)compute their first instance.

Todo: close class flipping CBS hole.
Todo: implement delayed BW release.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Fixes: 67dfa1b756 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 07:42:48 +01:00
Xunlei Pang 16b269436b sched/deadline: Modify cpudl::free_cpus to reflect rd->online
Currently, cpudl::free_cpus contains all CPUs during init, see
cpudl_init(). When calling cpudl_find(), we have to add rd->span
to avoid selecting the cpu outside the current root domain, because
cpus_allowed cannot be depended on when performing clustered
scheduling using the cpuset, see find_later_rq().

This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
so we can avoid the rd->span operation when calling cpudl_find()
in find_later_rq().

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421642980-10045-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:39:16 +01:00
Luca Abeni 269ad8015a sched/deadline: Avoid double-accounting in case of missed deadlines
The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is <= 0. However, it also checks if the
scheduling deadline has been missed (the current time is
larger than the current scheduling deadline), further
decreasing the runtime if this happens.
This "double accounting" is wrong:

- In case of partitioned scheduling (or single CPU), this
  happens if task_tick_dl() has been called later than expected
  (due to small HZ values). In this case, the current runtime is
  also negative, and replenish_dl_entity() can take care of the
  deadline miss by recharging the current runtime to a value smaller
  than dl_runtime

- In case of global scheduling on multiple CPUs, scheduling
  deadlines can be missed even if the task did not consume more
  runtime than expected, hence penalizing the task is wrong

This patch fix this problem by throttling a SCHED_DEADLINE task
only when its runtime becomes negative, and not modifying the runtime

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09 11:18:57 +01:00
Luca Abeni 6a503c3be9 sched/deadline: Fix migration of SCHED_DEADLINE tasks
According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:

	deactivate_task(rq, next_task, 0);
	set_task_cpu(next_task, later_rq->cpu);
	activate_task(later_rq, next_task, 0);

which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().

enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.

As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2

This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09 11:18:56 +01:00
Wanpeng Li 36ce98818a sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with
the fair class.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:59:03 +01:00
Wanpeng Li c51b8ab5ad sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
Do not call dequeue_pushable_dl_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.

Actually the patch is the same behavior as commit 311e800e16 ("sched,
rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:57 +01:00
Wanpeng Li 6c1d9410f0 sched: Move p->nr_cpus_allowed check to select_task_rq()
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:55 +01:00
Ingo Molnar e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00
Stanislaw Gruszka 6e998916df sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
Commit d670ec1317 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include <unistd.h>
	#include <sys/syscall.h>
	#include <stdio.h>
	#include <time.h>
	#include <pthread.h>
	#include <stdint.h>
	#include <inttypes.h>

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&barrier, NULL, 2);

		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&sleeptimeabs, &after);
		if (sleepdiff < 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&before, &after);
		if (diffabs < sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:20 +01:00
Wanpeng Li cad3bb32e1 sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
There are both UP and SMP version of pull_dl_task(), so don't need
to check CONFIG_SMP in switched_from_dl();

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:56 +01:00
Wanpeng Li cd66091162 sched/deadline: Reschedule from switched_from_dl() after a successful pull
In switched_from_dl() we have to issue a resched if we successfully
pulled some task from other cpus. This patch also aligns the behavior
with -rt.

Suggested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:55 +01:00
Wanpeng Li 6b0a563f3a sched/deadline: Push task away if the deadline is equal to curr during wakeup
This patch pushes task away if the dealine of the task is equal
to current during wake up. The same behavior as rt class.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:55 +01:00
Wanpeng Li acb32132ec sched/deadline: Add deadline rq status print
This patch add deadline rq status print.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:54 +01:00
Wanpeng Li 804968809c sched/deadline: Fix artificial overrun introduced by yield_task_dl()
The yield semantic of deadline class is to reduce remaining runtime to
zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
is reduced from the budget of yield task again even if it has already been
set to zero which leads to artificial overrun. This patch fix it by make
sure we don't steal some more time from the task that yielded in update_curr_dl().

Suggested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:53 +01:00
Kirill Tkhai 67dfa1b756 sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()
Currently used hrtimer_try_to_cancel() is racy:

raw_spin_lock(&rq->lock)
...                            dl_task_timer                 raw_spin_lock(&rq->lock)
...                               raw_spin_lock(&rq->lock)   ...
   switched_from_dl()             ...                        ...
      hrtimer_try_to_cancel()     ...                        ...
   switched_to_fair()             ...                        ...
...                               ...                        ...
...                               ...                        ...
raw_spin_unlock(&rq->lock)        ...                        (asquired)
...                               ...                        ...
...                               ...                        ...
do_exit()                         ...                        ...
   schedule()                     ...                        ...
      raw_spin_lock(&rq->lock)    ...                        raw_spin_unlock(&rq->lock)
      ...                         ...                        ...
      raw_spin_unlock(&rq->lock)  ...                        raw_spin_lock(&rq->lock)
      ...                         ...                        (asquired)
      put_task_struct()           ...                        ...
          free_task_struct()      ...                        ...
      ...                         ...                        raw_spin_unlock(&rq->lock)
...                               (asquired)                 ...
...                               ...                        ...
...                               (use after free)           ...

So, let's implement 100% guaranteed way to cancel the timer and let's
be sure we are safe even in very unlikely situations.

rq unlocking does not limit the area of switched_from_dl() use, because
this has already been possible in pull_dl_task() below.

Let's consider the safety of of this unlocking. New code in the patch
is working when hrtimer_try_to_cancel() fails. This means the callback
is running. In this case hrtimer_cancel() is just waiting till the
callback is finished. Two

1) Since we are in switched_from_dl(), new class is not dl_sched_class and
new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
right after !dl_task() check. After that hrtimer_cancel() returns back too.

The above is:

raw_spin_lock(rq->lock);                  ...
...                                       dl_task_timer()
...                                          raw_spin_lock(rq->lock);
   switched_from_dl()                        ...
       hrtimer_try_to_cancel()               ...
          raw_spin_unlock(rq->lock);         ...
          hrtimer_cancel()                   ...
          ...                                raw_spin_unlock(rq->lock);
          ...                                return HRTIMER_NORESTART;
          ...                             ...
          raw_spin_lock(rq->lock);        ...

2) But the below is also possible:
                                   dl_task_timer()
                                      raw_spin_lock(rq->lock);
                                      ...
                                      raw_spin_unlock(rq->lock);
raw_spin_lock(rq->lock);              ...
   switched_from_dl()                 ...
       hrtimer_try_to_cancel()        ...
       ...                            return HRTIMER_NORESTART;
       raw_spin_unlock(rq->lock);  ...
       hrtimer_cancel();           ...
       raw_spin_lock(rq->lock);    ...

In this case hrtimer_cancel() returns immediately. Very unlikely case,
just to mention.

Nobody can manipulate the task, because check_class_changed() is
always called with pi_lock locked. Nobody can force the task to
participate in (concurrent) priority inheritance schemes (the same reason).

All concurrent task operations require pi_lock, which is held by us.
No deadlocks with dl_task_timer() are possible, because it returns
right after !dl_task() check (it does nothing).

If we receive a new dl_task during the time of unlocked rq, we just
don't have to do pull_dl_task() in switched_from_dl() further.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
[ Added comments]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:50 +01:00
Wanpeng Li f4e9d94a5b sched/deadline: Don't balance during wakeup if wakee is pinned
Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
can be used, and saves some cycles for pinned tasks.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:48:02 +01:00
Wanpeng Li 1d7e974cbf sched/deadline: Don't check SD_BALANCE_FORK
There is no need to do balance during fork since SCHED_DEADLINE
tasks can't fork. This patch avoid the SD_BALANCE_FORK check.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:48:01 +01:00
Juri Lelli 7f51412a41 sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:

 - No check is performed when the user tries to attach a task to
   an exlusive cpuset (recall that exclusive cpusets have an
   associated maximum allowed bandwidth).

 - Bandwidths of source and destination cpusets are not correctly
   updated after a task is migrated between them.

This patch fixes both things at once, as they are opposite faces
of the same coin.

The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.

Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:58 +01:00
Wanpeng Li d9aade7ae1 sched/deadline: Do not try to push tasks if pinned task switches to dl
As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118):

 | If rq has already had 2 or more pushable tasks and we try to add a
 | pinned task then call of push_rt_task will just waste a time.

Just switched pinned task is not able to be pushed. If the rq has had
several dl tasks before they have already been considered as candidates
to be pushed (or pulled). This patch implements the same behavior as rt
class which introduced by commit 1044791755 ("sched/rt: Do not try to
push tasks if pinned task switches to RT").

Suggested-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:57 +01:00
Kirill Tkhai f3a7e1a9c4 sched/dl: Fix preemption checks
1) switched_to_dl() check is wrong. We reschedule only
   if rq->curr is deadline task, and we do not reschedule
   if it's a lower priority task. But we must always
   preempt a task of other classes.

2) dl_task_timer():
   Policy does not change in case of priority inheritance.
   rt_mutex_setprio() changes prio, while policy remains old.

So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.

(I didn't change switched_from_dl() because no check is necessary
there at all).

I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now...  I suppose some performance tests
may work better after this.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:10 +01:00
Juri Lelli aee38ea954 sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
dl_task_timer() is racy against several paths. Daniel noticed that
the replenishment timer may experience a race condition against an
enqueue_dl_entity() called from rt_mutex_setprio(). With his own
words:

 rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
 start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
 sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
 throttled is 0

=> BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
enqueued on the -deadline runqueue.

As we do for the other races, we just bail out in the replenishment
timer code.

Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: vincent@legout.info
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:01 +01:00
Juri Lelli 64be6f1f5f sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
In the deboost path, right after the dl_boosted flag has been
reset, we can currently end up replenishing using -deadline
parameters of a !SCHED_DEADLINE entity. This of course causes
a bug, as those parameters are empty.

In the case depicted above it is safe to simply bail out, as
the deboosted task is going to be back to its original scheduling
class anyway.

Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: vincent@legout.info
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:46:00 +01:00
Linus Torvalds 0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Juri Lelli 91ec6778ec sched/deadline: Fix inter- exclusive cpusets migrations
Users can perform clustered scheduling using the cpuset facility.
After an exclusive cpuset is created, task migrations happen only
between CPUs belonging to the same cpuset. Inter- cpuset migrations
can only happen when the user requires so, moving a task between
different cpusets. This behaviour is broken in SCHED_DEADLINE, as
currently spurious inter- cpuset migration may happen without user
intervention.

This patch fix the problem (and shuffles the code a bit to improve
clarity).

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: raistlin@linux.it
Cc: michael@amarulasolutions.com
Cc: fchecconi@gmail.com
Cc: daniel.wagner@bmw-carit.de
Cc: vincent@legout.info
Cc: luca.abeni@unitn.it
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:57 +02:00
Juri Lelli a5e7be3b28 sched/deadline: Clear dl_entity params when setscheduling to different class
When a task is using SCHED_DEADLINE and the user setschedules it to a
different class its sched_dl_entity static parameters are not cleaned
up. This causes a bug if the user sets it back to SCHED_DEADLINE with
the same parameters again.  The problem resides in the check we
perform at the very beginning of dl_overflow():

	if (new_bw == p->dl.dl_bw)
		return 0;

This condition is met in the case depicted above, so the function
returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not
added to it). After this, admission control is broken.

This patch fixes the thing, properly clearing static parameters for a
task that ceases to use SCHED_DEADLINE.

Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:46:56 +02:00
Kirill Tkhai 1ba93d4272 sched/dl: Simplify pick_dl_task()
1) Nobody calls pick_dl_task() with negative cpu, it's old RT leftover.

2) If p->nr_cpus_allowed is 1, than the affinity has just been changed
  in set_cpus_allowed_ptr(); we'll pick it just earlier than migration
  thread.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529340.3569.27.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:20 +02:00
xiaofeng.yan 177ef2a631 sched/deadline: Fix a precision problem in the microseconds range
An overrun could happen in function start_hrtick_dl()
when a task with SCHED_DEADLINE runs in the microseconds
range.

For example, if a task with SCHED_DEADLINE has the following parameters:

  Task  runtime  deadline  period
   P1   200us     500us    500us

The deadline and period from task P1 are less than 1ms.

In order to achieve microsecond precision, we need to enable HRTICK feature
by the next command:

  PC#echo "HRTICK" > /sys/kernel/debug/sched_features
  PC#trace-cmd record -e sched_switch &
  PC#./schedtool -E -t 200000:500000:500000 -e ./test

The binary test is in an endless while(1) loop here.
Some pieces of trace.dat are as follows:

  <idle>-0   157.603157: sched_switch: :R ==> 2481:4294967295: test
  test-2481  157.603203: sched_switch:  2481:R ==> 0:120: swapper/2
  <idle>-0   157.605657: sched_switch:  :R ==> 2481:4294967295: test
  test-2481  157.608183: sched_switch:  2481:R ==> 2483:120: trace-cmd
  trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test

We can get the runtime of P1 from the information above:

  runtime = 157.608183 - 157.605657
  runtime = 0.002526(2.526ms)

The correct runtime should be less than or equal to 200us at some point.

The problem is caused by a conditional judgment "delta > 10000"
in function start_hrtick_dl().

Because no hrtimer start up to control the rest of runtime
when the reset of runtime is less than 10us.

So the process will continue to run until tick-period is coming.

Move the code with the limit of the least time slice
from hrtick_start_fair() to hrtick_start() because the
EDF schedule class also needs this function in start_hrtick_dl().

To fix this problem, we call hrtimer_start() unconditionally in
start_hrtick_dl(), and make sure the scheduling slice won't be smaller
than 10us in hrtimer_start().

Signed-off-by: Xiaofeng Yan <xiaofeng.yan@huawei.com>
Reviewed-by: Li Zefan <lizefan@huawei.com>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
[ Massaged the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-07 11:09:59 +02:00
Christoph Lameter 4ba2968420 percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
__get_cpu_var can paper over differences in the definitions of
cpumask_var_t and either use the address of the cpumask variable
directly or perform a fetch of the address of the struct cpumask
allocated elsewhere. This is important particularly when using per cpu
cpumask_var_t declarations because in one case we have an offset into
a per cpu area to handle and in the other case we need to fetch a
pointer from the offset.

This patch introduces a new macro

this_cpu_cpumask_var_ptr()

that is defined where cpumask_var_t is defined and performs the proper
actions. All use cases where __get_cpu_var is used with cpumask_var_t
are converted to the use of this_cpu_cpumask_var_ptr().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-28 08:58:57 -04:00
Kirill Tkhai da0c1e65b5 sched: Add wrapper for checking task_struct::on_rq
Implement task_on_rq_queued() and use it everywhere instead of
on_rq check. No functional changes.

The only exception is we do not use the wrapper in
check_for_tasks(), because it requires to export
task_on_rq_queued() in global header files. Next patch in series
would return it back, so we do not twist it from here to there.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-20 14:52:59 +02:00
xiaofeng.yan 1b09d29bc0 sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1404712744-16986-1-git-send-email-xiaofeng.yan@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:20 +02:00
Kirill Tkhai 8875125efe sched: Transform resched_task() into resched_curr()
We always use resched_task() with rq->curr argument.
It's not possible to reschedule any task but rq's current.

The patch introduces resched_curr(struct rq *) to
replace all of the repeating patterns. The main aim
is cleanup, but there is a little size profit too:

  (before)
	$ size kernel/sched/built-in.o
	   text	   data	    bss	    dec	    hex	filename
	155274	  16445	   7042	 178761	  2ba49	kernel/sched/built-in.o

	$ size vmlinux
	   text	   data	    bss	    dec	    hex	filename
	7411490	1178376	 991232	9581098	 92322a	vmlinux

  (after)
	$ size kernel/sched/built-in.o
	   text	   data	    bss	    dec	    hex	filename
	155130	  16445	   7042	 178617	  2b9b9	kernel/sched/built-in.o

	$ size vmlinux
	   text	   data	    bss	    dec	    hex	filename
	7411362	1178376	 991232	9580970	 9231aa	vmlinux

	I was choosing between resched_curr() and resched_rq(),
	and the first name looks better for me.

A little lie in Documentation/trace/ftrace.txt. I have not
actually collected the tracing again. With a hope the patch
won't make execution times much worse :)

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 13:38:19 +02:00
Linus Torvalds b2e09f633a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more scheduler updates from Ingo Molnar:
 "Second round of scheduler changes:
   - try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
   - continued power scheduling cleanups and refactorings, from Nicolas
     Pitre
   - misc fixes and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Delete extraneous extern for to_ratio()
  sched/idle: Optimize try-to-wake-up IPI
  sched/idle: Simplify wake_up_idle_cpu()
  sched/idle: Clear polling before descheduling the idle thread
  sched, trace: Add a tracepoint for IPI-less remote wakeups
  cpuidle: Set polling in poll_idle
  sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
  sched: Rename capacity related flags
  sched: Final power vs. capacity cleanups
  sched: Remove remaining dubious usage of "power"
  sched: Let 'struct sched_group_power' care about CPU capacity
  sched/fair: Disambiguate existing/remaining "capacity" usage
  sched/fair: Change "has_capacity" to "has_free_capacity"
  sched/fair: Remove "power" from 'struct numa_stats'
  sched: Fix signedness bug in yield_to()
  sched/fair: Use time_after() in record_wakee()
  sched/balancing: Reduce the rate of needless idle load balancing
  sched/fair: Fix unlocked reads of some cfs_b->quota/period
2014-06-12 19:42:15 -07:00
Linus Torvalds 3f17ea6dea Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
  ufs: sb mutex merge + mutex_destroy
  powerpc: update comments for generic idle conversion
  cris: update comments for generic idle conversion
  idle: remove cpu_idle() forward declarations
  nbd: zero from and len fields in NBD_CMD_DISCONNECT.
  mm: convert some level-less printks to pr_*
  MAINTAINERS: adi-buildroot-devel is moderated
  MAINTAINERS: add linux-api for review of API/ABI changes
  mm/kmemleak-test.c: use pr_fmt for logging
  fs/dlm/debug_fs.c: replace seq_printf by seq_puts
  fs/dlm/lockspace.c: convert simple_str to kstr
  fs/dlm/config.c: convert simple_str to kstr
  mm: mark remap_file_pages() syscall as deprecated
  mm: memcontrol: remove unnecessary memcg argument from soft limit functions
  mm: memcontrol: clean up memcg zoneinfo lookup
  mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
  mm/mempool.c: update the kmemleak stack trace for mempool allocations
  lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
  mm: introduce kmemleak_update_trace()
  mm/kmemleak.c: use %u to print ->checksum
  ...
2014-06-08 11:31:16 -07:00
Paul Gortmaker f602d06327 sched/deadline: Delete extraneous extern for to_ratio()
There was a prototype for it added to kernel/sched/sched.h
at the same time the extern was added, so the extern in
the C file was never really ever needed.

See commit 332ac17ef5
("sched/deadline: Add bandwidth management for SCHED_DEADLINE
tasks") for details.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Link: http://lkml.kernel.org/r/1400013605-18717-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 14:22:24 +02:00
Kirill Tkhai 0f397f2c90 sched/dl: Fix race in dl_task_timer()
Throttled task is still on rq, and it may be moved to other cpu
if user is playing with sched_setaffinity(). Therefore, unlocked
task_rq() access makes the race.

Juri Lelli reports he got this race when dl_bandwidth_enabled()
was not set.

Other thing, pointed by Peter Zijlstra:

   "Now I suppose the problem can still actually happen when
    you change the root domain and trigger a effective affinity
    change that way".

To fix that we do the same as made in __task_rq_lock(). We do not
use __task_rq_lock() itself, because it has a useful lockdep check,
which is not correct in case of dl_task_timer(). We do not need
pi_lock locked here. This case is an exception (PeterZ):

   "The only reason we don't strictly need ->pi_lock now is because
    we're guaranteed to have p->state == TASK_RUNNING here and are
    thus free of ttwu races".

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3056991400578422@web14g.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-05 11:51:12 +02:00
John Stultz c224815dac printk: Add printk_deferred_once
Two of the three prink_deferred uses are really printk_once style
uses, so add a printk_deferred_once macro to simplify those call
sites.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
John Stultz aac74dc495 printk: rename printk_sched to printk_deferred
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.

This only changes the function name. No logic changes.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
xiaofeng.yan 4027d08085 sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1399605687-18094-1-git-send-email-xiaofeng.yan@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22 11:16:37 +02:00
Kirill Tkhai 7246544786 sched, nohz: Change rq->nr_running to always use wrappers
Sometimes ->nr_running may cross 2 but interrupt is not being
sent to rq's cpu. In this case we don't reenable the timer.
Looks like this may be the reason for rare unexpected effects,
if nohz is enabled.

Patch replaces all places of direct changing of nr_running
and makes add_nr_running() caring about crossing border.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22 11:16:33 +02:00
Juri Lelli 5bfd126e80 sched/deadline: Fix sched_yield() behavior
yield_task_dl() is broken:

 o it forces current to be throttled setting its runtime to zero;
 o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
   will queue it back with proper parameters at replenish time.

Unfortunately, dl_task_timer() has this check at the very beginning:

	if (!dl_task(p) || dl_se->dl_new)
		goto unlock;

So, it just bails out and the task is never replenished. It actually
yielded forever.

To fix this, introduce a new flag indicating that the task properly yielded
the CPU before its current runtime expired. While this is a little overdoing
at the moment, the flag would be useful in the future to discriminate between
"good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
and "bad" jobs (for which dl_throttled task has been set) that needed to be
stopped.

Reported-by: yjay.kim <yjay.kim@lge.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07 11:51:31 +02:00
Kirill Tkhai a1d9a3231e sched: Check for stop task appearance when balancing happens
We need to do it like we do for the other higher priority classes..

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/336561397137116@web27h.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17 13:39:51 +02:00
Kirill Tkhai 734ff2a71f sched/rt: Fix picking RT and DL tasks from empty queue
The problems:

1) We check for rt_nr_running before call of put_prev_task().
   If previous task is RT, its rt_rq may become throttled
   and dequeued after this call.

In case of p is from rt->rq this just causes picking a task
from throttled queue, but in case of its rt_rq is child
we are guaranteed catch BUG_ON.

2) The same with deadline class. The only difference we operate
   on only dl_rq.

This patch fixes all the above problems and it adds a small skip in the
DL update like we've already done for RT class:

	if (unlikely((s64)delta_exec <= 0))
		return;

This will optimize sequential update_curr_dl() calls a little.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 12:05:35 +01:00
Ingo Molnar a02ed5e3e0 Merge branch 'sched/urgent' into sched/core
Pick up fixes before queueing up new changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:34:27 +01:00
Juri Lelli faa5993736 sched/deadline: Prevent rt_time growth to infinity
Kirill Tkhai noted:

  Since deadline tasks share rt bandwidth, we must care about
  bandwidth timer set. Otherwise rt_time may grow up to infinity
  in update_curr_dl(), if there are no other available RT tasks
  on top level bandwidth.

RT task were in fact throttled right after they got enqueued,
and never executed again (rt_time never again went below rt_runtime).

Peter then proposed to accrue DL execution on rt_time only when
rt timer is active, and proposed a patch (this patch is a slight
modification of that) to implement that behavior. While this
solves Kirill problem, it has a drawback.

Indeed, Kirill noted again:

  It looks we may get into a situation, when all CPU time is shared
  between RT and DL tasks:

  rt_runtime = n
  rt_period  = 2n

  | RT working, DL sleeping  | DL working, RT sleeping      |
  -----------------------------------------------------------
  | (1)     duration = n     | (2)     duration = n         | (repeat)
  |--------------------------|------------------------------|
  | (rt_bw timer is running) | (rt_bw timer is not running) |

  No time for fair tasks at all.

While this can happen during the first period, if rq is always backlogged,
RT tasks won't have the opportunity to execute anymore: rt_time reached
rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
throttled since rt timer didn't fire, replenishment is from now on eaten up
by DL tasks that accrue their execution on rt_time (while rt timer is
active - we have an RT task waiting for replenishment). FAIR tasks are
not touched after this first period. Ok, this is not ideal, and the situation
is even worse!

What above (the nice case), practically never happens in reality, where
your rt timer is not aligned to tasks periods, tasks are in general not
periodic, etc.. Long story short, you always risk to overload your system.

This patch is based on Peter's idea, but exploits an additional fact:
if you don't have RT tasks enqueued, it makes little sense to continue
incrementing rt_time once you reached the upper limit (DL tasks have their
own mechanism for throttling).

This cures both problems:

 - no matter how many DL instances in the past, you'll have an rt_time
   slightly above rt_runtime when an RT task is enqueued, and from that
   point on (after the first replenishment), the task will normally execute;

 - you can still eat up all bandwidth during the first period, but not
   anymore after that, remember that DL execution will increment rt_time
   till the upper limit is reached.

The situation is still not perfect! But, we have a simple solution for now,
that limits how much you can jeopardize your system, as we keep working
towards the right answer: RT groups scheduled using deadline servers.

Reported-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:29:41 +01:00
Kirill Tkhai 3908ac13b5 sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration
In deadline class we do not have group scheduling.

So, let's remove unnecessary

	X = X;

equations.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-27 12:29:39 +01:00
Peter Zijlstra dc87734106 sched: Remove some #ifdeffery
Remove a few gratuitous #ifdefs in pick_next_task*().

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21 21:43:18 +01:00