On Sparc and S390 the removal of irq.h from kernel_stat.h causes:
kernel/softirq.c:774:9: error: 'NR_IRQS_LEGACY' undeclared
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If online_css() fails, we should remove cgroup files belonging
to css->ss.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull scheduler fixes from Ingo Molnar:
"Three small fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/clock: Prevent tracing recursion in sched_clock_cpu()
stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus()
sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy
The flag is necessary for interrupt chips which require an ACK/EOI
after the handler has run. In case of threaded handlers this needs to
happen after the threaded handler has completed before the unmask of
the interrupt.
The flag is only unseful in combination with the handle_fasteoi_irq
flow control handler.
It can be combined with the flag IRQCHIP_EOI_IF_HANDLED, so the EOI is
not issued when the interrupt is disabled or in progress.
Tested-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-sunxi@googlegroups.com
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Link: http://lkml.kernel.org/r/1394733834-26839-2-git-send-email-hdegoede@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This was a debugging measure to toggle enabled/disabled
when testing. But for real production setups, it's not
safe to toggle this setting without either reloading
drivers of quiescing IO first. Neither of which the toggle
enforces.
Additionally, it makes drivers deal with the conditional
state.
Remove it completely. It's up to the driver whether iopoll
is enabled or not.
Signed-off-by: Jens Axboe <axboe@fb.com>
When update_rq_clock_task() accounts the pending steal time for a task,
it converts the steal delta from nsecs to tick then from tick to nsecs.
There is no apparent good reason for doing that though because both
the task clock and the prev steal delta are u64 and store values
in nsecs.
So lets remove the needless conversion.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The steal guest time accounting code assumes that cputime_t is based on
jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
is based on nsecs, steal_account_process_tick() passes the delta in
jiffies to account_steal_time() which then accounts it as if it's a
value in nsecs.
As a result, accounting 1 second of steal time (with HZ=100 that would
be 100 jiffies) is spuriously accounted as 100 nsecs.
As such /proc/stat may report 0 values of steal time even when two
guests have run concurrently for a few seconds on the same host and
same CPU.
In order to fix this, lets convert the nsecs based steal delta to
cputime instead of jiffies by using the right conversion API.
Given that the steal time is stored in cputime_t and this type can have
a smaller granularity than nsecs, we only account the rounded converted
value and leave the remaining nsecs for the next deltas.
Reported-by: Huiqingding <huding@redhat.com>
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Merge the request/release callbacks which are in a separate branch for
consumption by the gpio folks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For certain irq types, e.g. gpios, it's necessary to request resources
before starting up the irq.
This might fail so we cannot use the irq_startup() callback because we
might call the irq_set_type() callback before that which does not make
sense when the resource is not available. Calling irq_startup() before
irq_set_type() can lead to spurious interrupts which is not desired
either.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jean-Jacques Hiblot <jjhiblot@traphandler.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1403080857160.18573@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
OK, so commit:
1d8fe7dc80 ("locking/mutexes: Unlock the mutex without the wait_lock")
generates this boot warning when CONFIG_DEBUG_MUTEXES=y:
WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.c:82 debug_mutex_unlock+0x155/0x180() DEBUG_LOCKS_WARN_ON(lock->owner != current)
And that makes sense, because as soon as we release the lock a
new owner can come in...
One would think that !__mutex_slowpath_needs_to_unlock()
implementations suffer the same, but for DEBUG we fall back to
mutex-null.h which has an unconditional 1 for that.
The mutex debug code requires the mutex to be unlocked after
doing the debug checks, otherwise it can find inconsistent
state.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jason.low2@hp.com
Link: http://lkml.kernel.org/r/20140312122442.GB27965@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The tmp value has been already calculated in:
scaled_busy_load_per_task =
(busiest->load_per_task * SCHED_POWER_SCALE) /
busiest->group_power;
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394555166-22894-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:
linux-next commit c365c292d0
"sched: Consider pi boosting in setscheduler()"
And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:
- p->normal_prio = normal_prio(p);
- p->prio = rt_mutex_getprio(p);
With no
+ p->normal_prio = normal_prio(p);
+ p->prio = rt_mutex_getprio(p);
Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.
The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.
Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix descriptions of /sys/power/state in the documentation and in
a code comment.
Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
[rjw: Changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Spelling fix.
Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull audit namespace fixes from Eric Biederman:
"Starting with 3.14-rc1 the audit code is faulty (think oopses and
races) with respect to how it computes the network namespace of which
socket to reply to, and I happened to notice by chance when reading
through the code.
My testing and the automated build bots don't find any problems with
these fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
audit: Update kdoc for audit_send_reply and audit_list_rules_send
audit: Send replies in the proper network namespace.
audit: Use struct net not pid_t to remember the network namespce to reply in
Add in an extra reschedule in an attempt to avoid getting reschedule
the moment we've acquired the lock.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-zah5eyn9gu7qlgwh9r6n2anc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The mutex->spin_mlock was introduced in order to ensure that only 1 thread
spins for lock acquisition at a time to reduce cache line contention. When
lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
continually release and obtain the lock->spin_mlock. This can generate
quite a bit of overhead/contention, and also might just delay the spinner
from getting the lock.
This patch modifies the way optimistic spinners are queued by queuing before
entering the optimistic spinning loop as oppose to acquiring before every
call to mutex_spin_on_owner(). So in situations where the spinner requires
a few extra spins before obtaining the lock, then there will only be 1 spinner
trying to get the lock and it will avoid the overhead from unnecessarily
unlocking and locking the spin_mlock.
Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: tglx@linutronix.de
Cc: riel@redhat.com
Cc: akpm@linux-foundation.org
Cc: davidlohr@hp.com
Cc: hpa@zytor.com
Cc: andi@firstfloor.org
Cc: aswin@hp.com
Cc: scott.norton@hp.com
Cc: chegu_vinod@hp.com
Cc: Waiman.Long@hp.com
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390936396-3962-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The mcs_spinlock code is not meant (or suitable) as a generic locking
primitive, therefore take it away from the normal includes and place
it in kernel/locking/.
This way the locking primitives implemented there can use it as part
of their implementation but we do not risk it getting used
inapropriately.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-byirmpamgr7h25m5kyavwpzx@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Check for fair tasks number to decide, that we've pulled a task.
rq's nr_running may contain throttled RT tasks.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394118975.19290.104.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1) Single cpu machine case.
When rq has only RT tasks, but no one of them can be picked
because of throttling, we enter in endless loop.
pick_next_task_{dl,rt} return NULL.
In pick_next_task_fair() we permanently go to retry
if (rq->nr_running != rq->cfs.h_nr_running)
return RETRY_TASK;
(rq->nr_running is not being decremented when rt_rq becomes
throttled).
No chances to unthrottle any rt_rq or to wake fair here,
because of rq is locked permanently and interrupts are
disabled.
2) In case of SMP this can cause a hang too. Although we unlock
rq in idle_balance(), interrupts are still disabled.
The solution is to check for available tasks in DL and RT
classes instead of checking for sum.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394098321.19290.11.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We close idle_exit_fair() bracket in case of we've pulled something or we've received
task of high priority class.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1394098315.19290.10.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The problems:
1) We check for rt_nr_running before call of put_prev_task().
If previous task is RT, its rt_rq may become throttled
and dequeued after this call.
In case of p is from rt->rq this just causes picking a task
from throttled queue, but in case of its rt_rq is child
we are guaranteed catch BUG_ON.
2) The same with deadline class. The only difference we operate
on only dl_rq.
This patch fixes all the above problems and it adds a small skip in the
DL update like we've already done for RT class:
if (unlikely((s64)delta_exec <= 0))
return;
This will optimize sequential update_curr_dl() calls a little.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.
The user space stack dump is just another source of the this issue.
Related list discussions:
http://marc.info/?t=139302086500001&r=1&w=2http://marc.info/?t=139301437300003&r=1&w=2
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.
Related list discussions:
http://marc.info/?t=139302086500001&r=1&w=2http://marc.info/?t=139301437300003&r=1&w=2
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have the main cpuidle function in idle.c, move some code from
the idle mainloop to this function for the sake of clarity.
That removes if then else indentation difficult to follow when looking at the
code. This patch does not change the current behavior.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-3-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cpuidle_idle_call does nothing more than calling the three individuals
function and is no longer used by any arch specific code but only in the
cpuidle framework code.
We can move this function into the idle task code to ensure better
proximity to the scheduler code.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prevent tracing of preempt_disable/enable() in sched_clock_cpu().
When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are
traced and this causes trace_clock() users (and probably others) to
go into an infinite recursion. Systems with a stable sched_clock()
are not affected.
This problem is similar to that fixed by upstream commit 95ef1e5292
("KVM guest: prevent tracing recursion with kvmclock").
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We must use smp_call_function_single(.wait=1) for the
irq_cpu_stop_queue_work() to ensure the queueing is actually done under
stop_cpus_lock. Without this we could have dropped the lock by the time
we do the queueing and get the race we tried to fix.
Fixes: 7053ea1a34 ("stop_machine: Fix race between stop_two_cpus() and stop_cpus()")
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140228123905.GK3104@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Deny the use of SCHED_DEADLINE policy to unprivileged users.
Even if root users can set the policy for normal users, we
don't want the latter to be able to change their parameters
(safest behavior).
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Double ! or !! are normally required to get 0 or 1 out of a expression. A
comparision always returns 0 or 1 and hence there is no need to apply double !
over it again.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes. It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g. through a subsequent allocation request
without GFP_THISNODE set.
However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary. This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.
Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE. This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kbuild test robot reported:
> tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-next
> head: 6f285b19d0
> commit: 6f285b19d0 [2/2] audit: Send replies in the proper network namespace.
> reproduce: make htmldocs
>
> >> Warning(kernel/audit.c:575): No description found for parameter 'request_skb'
> >> Warning(kernel/audit.c:575): Excess function parameter 'portid' description in 'audit_send_reply'
> >> Warning(kernel/auditfilter.c:1074): No description found for parameter 'request_skb'
> >> Warning(kernel/auditfilter.c:1074): Excess function parameter 'portid' description in 'audit_list_rules_s
Which was caused by my failure to update the kdoc annotations when I
updated the functions. Fix that small oversight now.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull cgroup fixes from Tejun Heo:
"Two cpuset locking fixes from Li. Both tagged for -stable"
* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: fix a race condition in __cpuset_node_allowed_softwall()
cpuset: fix a locking issue in cpuset_migrate_mm()
Developers would say they put a trace_printk() before and after the trace
event but when they enable it (and the trace event said it was enabled) they
would see the trace_printks but not the trace event.
I was not able to reproduce this, but that's because I wasn't looking at
the right location. Recently, another bug came up that showed the issue.
If your kernel supports signed modules but allows for non-signed modules
to be loaded, then when one is, the kernel will silently set the
MODULE_FORCED taint on the module. Although, this taint happens without
the need for insmod --force or anything of the kind, it labels the
module with that taint anyway.
If this tainted module has tracepoints, the tracepoints will be ignored
because of the MODULE_FORCED taint. But no error message will be
displayed. Worse yet, the event infrastructure will still be created
letting users enable the trace event represented by the tracepoint,
although that event will never actually be enabled. This is because
the tracepoint infrastructure allows for non-existing tracepoints to
be enabled for new modules to arrive and have their tracepoints set.
Although there are several things wrong with the above, this change
only addresses the creation of the trace event files for tracepoints
that are not created when a module is loaded and is tainted. This change
will print an error message about the module being tainted and not the
trace events will not be created, and it does not create the trace event
infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTFnMPAAoJEKQekfcNnQGuPPwH/Rtwy/siM+ltvlLnEbRjS4RL
9aF5mfJUazmfCaOBMSaMUo92uCbciIVif6icX843JmCdCOR5Hk5SZryBbt2A/dF9
TcMloKNbIn/ad7yZ0O75BJlPnRJ5RZ42edQfW1lkdeWo644C8Kj399fVPt7KU5SH
1KTWyShT05E2fYjp2lMrb+FOFfKerlzkXtgGwJKXnd/7hrbdmKEH/OO8YkMrlVZp
SURPyzNMMVKoUFY797b6FrFRqV04C210BtNcNrd4S3/V9VE4IPS/8YSLfvVaGkD0
e2kVAvIOkwPnYzMZg70jf2R8NlGS2mwaVC+NenBHz3KlpFdaeRu1hFw7/n8h2/s=
=YbJd
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"In the past, I've had lots of reports about trace events not working.
Developers would say they put a trace_printk() before and after the
trace event but when they enable it (and the trace event said it was
enabled) they would see the trace_printks but not the trace event.
I was not able to reproduce this, but that's because I wasn't looking
at the right location. Recently, another bug came up that showed the
issue.
If your kernel supports signed modules but allows for non-signed
modules to be loaded, then when one is, the kernel will silently set
the MODULE_FORCED taint on the module. Although, this taint happens
without the need for insmod --force or anything of the kind, it labels
the module with that taint anyway.
If this tainted module has tracepoints, the tracepoints will be
ignored because of the MODULE_FORCED taint. But no error message will
be displayed. Worse yet, the event infrastructure will still be
created letting users enable the trace event represented by the
tracepoint, although that event will never actually be enabled. This
is because the tracepoint infrastructure allows for non-existing
tracepoints to be enabled for new modules to arrive and have their
tracepoints set.
Although there are several things wrong with the above, this change
only addresses the creation of the trace event files for tracepoints
that are not created when a module is loaded and is tainted. This
change will print an error message about the module being tainted and
not the trace events will not be created, and it does not create the
trace event infrastructure"
* tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not add event files for modules that fail tracepoints
Pull irq fixes from Thomas Gleixner:
- a bugfix for a long standing waitqueue race
- a trivial fix for a missing include
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Include missing header file in irqdomain.c
genirq: Remove racy waitqueue_active check
In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that
performs proper zero and sign extension convert all 64 bit parameters
to their corresponding 32 bit compat counterparts.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that
performs proper zero and sign extension convert all 64 bit parameters
to their corresponding 32 bit compat counterparts.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Convert all compat system call functions where all parameter types
have a size of four or less than four bytes, or are pointer types
to COMPAT_SYSCALL_DEFINE.
The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper
zero and sign extension to 64 bit of all parameters if needed.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:
C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]
but should be:
C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]
Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.
This patch takes into account real completion size of the request and
fixes wrong completion accounting.
Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
No more users outside the core code. Put it into the poison
cabinet. That also gets rid of the linux/irq.h include in
kernel_stat.h
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140223212739.124207133@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is a common pattern all over the place:
kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq));
This results in a call to core code anyway. So provide a function
which does the same thing in core.
While at it, replace the butt ugly macro with an inline.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140223212737.422068876@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently we are using two lowest bit of base for internal purpose and
so they both should be zero in the allocated address. The code was
doing the right thing before this patch came in: commit c5f66e99b
(timer: Implement TIMER_IRQSAFE)
Tejun probably forgot to update this piece of code which checks if the
lowest 'n' bits are zero or not and so wasn't updated according to the
new flag. Lets use TIMER_FLAG_MASK in the calculations here.
[ tglx: Massaged changelog ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: tj@kernel.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/9144e10d7e854a0aa8a673332adec356d81a923c.1393576981.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.
Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.
Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org
Fixes: 6d723736e4 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org # 2.6.31+
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes, most of them SCHED_DEADLINE fallout"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Prevent rt_time growth to infinity
sched/deadline: Switch CPU's presence test order
sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration
sched: Fix double normalization of vruntime
If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().
This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Include appropriate header file kernel/time/timekeeping_internal.h in
kernel/time/timekeeping_debug.c because it has prototype declaration of
function defined in kernel/time/timekeeping_debug.c.
This eliminates the following warning in
kernel/time/timekeeping_debug.c:
kernel/time/timekeeping_debug.c:68:6: warning: no previous prototype for ‘tk_debug_account_sleep_time’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull perf fixes from Ingo Molnar:
"Misc fixes, most of them on the tooling side"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix strict alias issue for find_first_bit
perf tools: fix BFD detection on opensuse
perf: Fix hotplug splat
perf/x86: Fix event scheduling
perf symbols: Destroy unused symsrcs
perf annotate: Check availability of annotate when processing samples
In perverse cases of file descriptor passing the current network
namespace of a process and the network namespace of a socket used by
that socket may differ. Therefore use the network namespace of the
appropiate socket to ensure replies always go to the appropiate
socket.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use the name_to_dev_t call to parse the device name echo'd to
to /sys/power/resume. This imitates the method used in hibernate.c
in software_resume, and allows the resume partition to be specified
using other equivalent device formats as well. By allowing
/sys/debug/resume to accept the same syntax as the resume=device
parameter, we can parse the resume=device in the init script and
use the resume device directly from the kernel command line.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Include appropriate header file kernel/power/power.h in
kernel/power/wakelock.c because it has prototype declaration of function
defined in kernel/power/wakelock.c.
This eliminates the following warning in kernel/power/wakelock.c:
kernel/power/wakelock.c:34:9: warning: no previous prototype for ‘pm_show_wakelocks’ [-Wmissing-prototypes]
kernel/power/wakelock.c:184:5: warning: no previous prototype for ‘pm_wake_lock’ [-Wmissing-prototypes]
kernel/power/wakelock.c:232:5: warning: no previous prototype for ‘pm_wake_unlock’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move prototype declaration of function to header file
kernel/power/power.h because it is used by more than one file.
This eliminates the following warning in kernel/power/snapshot.c:
kernel/power/snapshot.c:1588:16: warning: no previous prototype for ‘swsusp_save’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid heavy conflicts caused by WIP patches in drivers/cpuidle/cpuidle.c,
by merging these into a single base.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In struct audit_netlink_list and audit_reply add a reference to the
network namespace of the caller and remove the userspace pid of the
caller. This cleanly remembers the callers network namespace, and
removes a huge class of races and nasty failure modes that can occur
when attempting to relook up the callers network namespace from a
pid_t (including the caller's network namespace changing, pid
wraparound, and the pid simply not being present).
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull RCU updates from Paul E. McKenney:
* Update RCU documentation. These were posted to LKML at
https://lkml.org/lkml/2014/2/17/555.
* Miscellaneous fixes. These were posted to LKML at
https://lkml.org/lkml/2014/2/17/530. Note that two of these
are RCU changes to other maintainer's trees: add1f09954
(fs) and 8857563b81 (notifer), both of which substitute
rcu_access_pointer() for rcu_dereference_raw().
* Real-time latency fixes. These were posted to LKML at
https://lkml.org/lkml/2014/2/17/544.
* Torture-test changes, including refactoring of rcutorture
and introduction of a vestigial locktorture. These were posted
to LKML at https://lkml.org/lkml/2014/2/17/599.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark function as static in kernel/seccomp.c because it is not used
outside this file.
This eliminates the following warning in kernel/seccomp.c:
kernel/seccomp.c:296:6: warning: no previous prototype for ?seccomp_attach_user_filter? [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Pull filesystem fixes from Jan Kara:
"Notification, writeback, udf, quota fixes
The notification patches are (with one exception) a fallout of my
fsnotify rework which went into -rc1 (I've extented LTP to cover these
cornercases to avoid similar breakage in future).
The UDF patch is a nasty data corruption Al has recently reported,
the revert of the writeback patch is due to possibility of violating
sync(2) guarantees, and a quota bug can lead to corruption of quota
files in ocfs2"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: Allocate overflow events with proper type
fanotify: Handle overflow in case of permission events
fsnotify: Fix detection whether overflow event is queued
Revert "writeback: do not sync data dirtied after sync start"
quota: Fix race between dqput() and dquot_scan_active()
udf: Fix data corruption on file type conversion
inotify: Fix reporting of cookies for inotify events
I can trigger a lockdep warning:
# mount -t cgroup -o cpuset xxx /cgroup
# mkdir /cgroup/cpuset
# mkdir /cgroup/tmp
# echo 0 > /cgroup/tmp/cpuset.cpus
# echo 0 > /cgroup/tmp/cpuset.mems
# echo 1 > /cgroup/tmp/cpuset.memory_migrate
# echo $$ > /cgroup/tmp/tasks
# echo 1 > /cgruop/tmp/cpuset.mems
===============================
[ INFO: suspicious RCU usage. ]
3.14.0-rc1-0.1-default+ #32 Not tainted
-------------------------------
include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
...
[<ffffffff81582174>] dump_stack+0x72/0x86
[<ffffffff810b8f01>] lockdep_rcu_suspicious+0x101/0x140
[<ffffffff81105ba1>] cpuset_migrate_mm+0xb1/0xe0
...
We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
we hold cpuset_mutex, which causes task_css() to complain.
This is not a false-positive but a real issue.
Holding cpuset_mutex won't prevent a task from migrating to another
cpuset, and it won't prevent the original task->cgroup from destroying
during this change.
Fixes: 5d21cc2db0 (cpuset: replace cgroup_mutex locking with cpuset internal locking)
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Li Zefan <lizefan@huawei.com>
Sigend-off-by: Tejun Heo <tj@kernel.org>
Include appropriate header file include/linux/of_irq.h in
kernel/irq/irqdomain.c because it contains prototype definition of
function define in kernel/irq/irqdomain.c.
This eliminates the following warning in kernel/irq/irqdomain.c:
kernel/irq/irqdomain.c:468:14: warning: no previous prototype for ‘irq_create_of_mapping’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/eb89aebea7ff1a46122918ac389ebecf8248be9a.1393493276.git.rashika.kheria@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the ctx pmu instead of the event pmu.
When a group leader is a software event but the group contains
hardware events, the entire group is on the hardware PMU.
Using the hardware PMU for the transaction makes most sense since
that's the most expensive one to programm (and software PMUs generally
don't have TXN support anyway).
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-sctoo9t2f3nn2c9g568928q3@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently perf_branch_stack_sched_in iterates over the set of pmus,
checks that each pmu has a flush_branch_stack callback, then overwrites
the pmu before calling the callback. This is either redundant or broken.
In systems with a single hw pmu, pmu == cpuctx->ctx.pmu, and thus the
assignment is redundant.
In systems with multiple hw pmus (i.e. multiple pmus with task_ctx_nr ==
perf_hw_context) the pmus share the same perf_cpu_context. Thus the
assignment can cause one of the pmus to flush its branch stack
repeatedly rather than causing each of the pmus to flush their branch
stacks. Worse still, if only some pmus have the callback the assignment
can result in a branch to NULL.
This patch removes the redundant assignment.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1392054264-23570-3-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For some reason find_pmu_context() is defined as returning void * rather
than a __percpu struct perf_cpu_context *. As all the requisite types are
defined in advance there's no reason to keep it that way.
This patch modifies the prototype of pmu_find_context to return a
__percpu struct perf_cpu_context *.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1392054264-23570-2-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use MAX_NICE instead of the value 19 for ring_buffer_benchmark.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1393251121-25534-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Michael spotted that the idle_balance() push down created a task
priority problem.
Previously, when we called idle_balance() before pick_next_task() it
wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
task slipped in.
Similarly for pre_schedule(), rt pre-schedule could have a dl task
slip in.
But by pulling it into the pick_next_task() loop, we'll not try a
higher task priority again.
Cure this by creating a re-start condition in pick_next_task(); and
triggering this from pick_next_task_{rt,fair}().
It also fixes a live-lock where we get stuck in pick_next_task_fair()
due to idle_balance() seeing !0 nr_running but there not actually
being any fair tasks about.
Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Fixes: 38033c37fa ("sched: Push down pre_schedule() and idle_balance()")
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The struct sched_avg of struct rq is only used in case group
scheduling is enabled inside __update_tg_runnable_avg() to update
per-cpu representation of a task group. I.e. that there is no need to
maintain the runnable avg of a rq in the !CONFIG_FAIR_GROUP_SCHED case.
This patch guards struct sched_avg of struct rq and
update_rq_runnable_avg() with CONFIG_FAIR_GROUP_SCHED.
There is an extra empty definition for update_rq_runnable_avg()
necessary for the !CONFIG_FAIR_GROUP_SCHED && CONFIG_SMP case.
The function print_cfs_group_stats() which prints out struct sched_avg
of struct rq is already guarded with CONFIG_FAIR_GROUP_SCHED.
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/530DCDC5.1060406@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Drew Richardson reported that he could make the kernel go *boom* when hotplugging
while having perf events active.
It turned out that when you have a group event, the code in
__perf_event_exit_context() fails to remove the group siblings from
the context.
We then proceed with destroying and freeing the event, and when you
re-plug the CPU and try and add another event to that CPU, things go
*boom* because you've still got dead entries there.
Reported-by: Drew Richardson <drew.richardson@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/n/tip-k6v5wundvusvcseqj1si0oz0@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kirill Tkhai noted:
Since deadline tasks share rt bandwidth, we must care about
bandwidth timer set. Otherwise rt_time may grow up to infinity
in update_curr_dl(), if there are no other available RT tasks
on top level bandwidth.
RT task were in fact throttled right after they got enqueued,
and never executed again (rt_time never again went below rt_runtime).
Peter then proposed to accrue DL execution on rt_time only when
rt timer is active, and proposed a patch (this patch is a slight
modification of that) to implement that behavior. While this
solves Kirill problem, it has a drawback.
Indeed, Kirill noted again:
It looks we may get into a situation, when all CPU time is shared
between RT and DL tasks:
rt_runtime = n
rt_period = 2n
| RT working, DL sleeping | DL working, RT sleeping |
-----------------------------------------------------------
| (1) duration = n | (2) duration = n | (repeat)
|--------------------------|------------------------------|
| (rt_bw timer is running) | (rt_bw timer is not running) |
No time for fair tasks at all.
While this can happen during the first period, if rq is always backlogged,
RT tasks won't have the opportunity to execute anymore: rt_time reached
rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
throttled since rt timer didn't fire, replenishment is from now on eaten up
by DL tasks that accrue their execution on rt_time (while rt timer is
active - we have an RT task waiting for replenishment). FAIR tasks are
not touched after this first period. Ok, this is not ideal, and the situation
is even worse!
What above (the nice case), practically never happens in reality, where
your rt timer is not aligned to tasks periods, tasks are in general not
periodic, etc.. Long story short, you always risk to overload your system.
This patch is based on Peter's idea, but exploits an additional fact:
if you don't have RT tasks enqueued, it makes little sense to continue
incrementing rt_time once you reached the upper limit (DL tasks have their
own mechanism for throttling).
This cures both problems:
- no matter how many DL instances in the past, you'll have an rt_time
slightly above rt_runtime when an RT task is enqueued, and from that
point on (after the first replenishment), the task will normally execute;
- you can still eat up all bandwidth during the first period, but not
anymore after that, remember that DL execution will increment rt_time
till the upper limit is reached.
The situation is still not perfect! But, we have a simple solution for now,
that limits how much you can jeopardize your system, as we keep working
towards the right answer: RT groups scheduled using deadline servers.
Reported-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In deadline class we do not have group scheduling.
So, let's remove unnecessary
X = X;
equations.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.
In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.
This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.
I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We hit one rare case below:
T1 calling disable_irq(), but hanging at synchronize_irq()
always;
The corresponding irq thread is in sleeping state;
And all CPUs are in idle state;
After analysis, we found there is one possible scenerio which
causes T1 is waiting there forever:
CPU0 CPU1
synchronize_irq()
wait_event()
spin_lock()
atomic_dec_and_test(&threads_active)
insert the __wait into queue
spin_unlock()
if(waitqueue_active)
atomic_read(&threads_active)
wake_up()
Here after inserted the __wait into queue on CPU0, and before
test if queue is empty on CPU1, there is no barrier, it maybe
cause it is not visible for CPU1 immediately, although CPU0 has
updated the queue list.
It is similar for CPU0 atomic_read() threads_active also.
So we'd need one smp_mb() before waitqueue_active.that, but removing
the waitqueue_active() check solves it as wel l and it makes
things simple and clear.
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Cc: Xiaoming Wang <xiaoming.wang@intel.com>
Link: http://lkml.kernel.org/r/1393212590-32543-1-git-send-email-chuansheng.liu@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have two identical copies of resource_contains() already, and more
places that could use it. This moves it to ioport.h where it can be
shared.
resource_contains(struct resource *r1, struct resource *r2) returns true
iff r1 and r2 are the same type (most callers already checked this
separately) and the r1 address range completely contains r2.
In addition, the new resource_contains() checks that both r1 and r2 have
addresses assigned to them. If a resource is IORESOURCE_UNSET, it doesn't
have a valid address and can't contain or be contained by another resource.
Some callers already check this or for res->start.
No functional change.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The kbuild test bot uncovered an implicit dependence on the
trace header being present before rcu.h in ia64 allmodconfig
that looks like this:
In file included from kernel/ksysfs.c:22:0:
kernel/rcu/rcu.h: In function '__rcu_reclaim':
kernel/rcu/rcu.h:107:3: error: implicit declaration of function 'trace_rcu_invoke_kfree_callback' [-Werror=implicit-function-declaration]
kernel/rcu/rcu.h:112:3: error: implicit declaration of function 'trace_rcu_invoke_callback' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
Looking at other rcu.h users, we can find that they all
were sourcing the trace header in advance of rcu.h itself,
as seen in the context of this diff. There were also some
inconsistencies as to whether it was or wasn't sourced based
on the parent tracing Kconfig.
Rather than "fix" it at each use site, and have inconsistent
use based on whether "#ifdef CONFIG_RCU_TRACE" was used or not,
lets just source the trace header just once, in the actual consumer
of it, which is rcu.h itself. We include it unconditionally, as
build testing shows us that is a hard requirement for some files.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit fixes the follwoing warning:
kernel/ksysfs.c:143:5: warning: symbol 'rcu_expedited' was not declared. Should it be static?
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
[ paulmck: Moved the declaration to include/linux/rcupdate.h to avoid
including the RCU-internal rcu.h file outside of RCU. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
(Trivial patch.)
If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded
to rcu_access_pointer(). This commit makes this downgrade in
__blocking_notifier_call_chain() which simply compares the RCU-protected
pointer against NULL with no dereferencing.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The function kgdb_breakpoint() sets up break point at
compile time by calling arch_kgdb_breakpoint();
Though this call is surrounded by wmb() barrier,
the compile can still re-order the break point,
because this scheduling barrier is not a code motion
barrier in gcc.
Making kgdb_breakpoint() as noinline solves this problem
of code reording around break point instruction and also
avoids problem of being called as inline function from
other places
More details about discussion on this can be found here
http://comments.gmane.org/gmane.linux.ports.arm.kernel/269732
Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The internal_add_timer() function updates base->next_timer only if
timer->expires < base->next_timer. This is correct, but it also makes
sense to do the same if we add the first non-deferrable timer.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
The __run_timers() function currently steps through the list one jiffy at
a time in order to update the timer wheel. However, if the timer wheel
is empty, no adjustment is needed other than updating ->timer_jiffies.
Therefore, just before we add a timer to an empty timer wheel, we should
mark the timer wheel as being up to date. This marking will reduce (and
perhaps eliminate) the jiffy-stepping that a future __run_timers() call
will need to do in response to some future timer posting or migration.
This commit therefore updates ->timer_jiffies for this case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
The __run_timers() function currently steps through the list one jiffy at
a time in order to update the timer wheel. However, if the timer wheel
is empty, no adjustment is needed other than updating ->timer_jiffies.
Therefore, if we just emptied the timer wheel, for example, by deleting
the last timer, we should mark the timer wheel as being up to date.
This marking will reduce (and perhaps eliminate) the jiffy-stepping that
a future __run_timers() call will need to do in response to some future
timer posting or migration. This commit therefore catches ->timer_jiffies
for this case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
The __run_timers() function currently steps through the list one jiffy at
a time in order to update the timer wheel. However, if the timer wheel
is empty, no adjustment is needed other than updating ->timer_jiffies.
In this case, which is likely to be common for NO_HZ_FULL kernels, the
kernel currently incurs a large latency for no good reason. This commit
therefore short-circuits this case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
Currently, the tvec_base structure's ->active_timers field tracks only
the non-deferrable timers, which means that even if ->active_timers is
zero, there might well be deferrable timers in the list. This commit
therefore adds an ->all_timers field to track all the timers, whether
deferrable or not.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Mike Galbraith <bitbucket@online.de>
The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().
The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.
As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.
Lets rename the function and enhance the comments so that they reflect
these properties.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.
This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.
This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.
Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:
1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
in an object. It looks like a nice and convenient pattern at the first
sight because we can then retrieve the object from the IPI handler easily.
But actually it is a waste of memory space in the object since the csd
can be allocated from the stack by smp_call_function_single(wait = 1)
and the object can be passed an the IPI argument.
Besides that, embedding the csd in an object is more error prone
because the caller must take care of the serialization of the IPIs
for this csd.
2) The user calls __smp_call_function_single(wait = 1) on a csd that
is allocated on the stack. It's ok but smp_call_function_single()
can do it as well and it already takes care of the allocation on the
stack. Again it's more simple and less error prone.
Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.
There was a single user of that which has just been converted.
So lets remove this option to discourage further users.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to remotely restart the watchdog hrtimer, update_timers()
allocates a csd on the stack and pass it to __smp_call_function_single().
There is no partcular need, however, for a specific csd here. Lets
simplify that a little by calling smp_call_function_single()
which can already take care of the csd allocation by itself.
Acked-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>