I noticed some CLOCK_TAI timer test failures on one of my
less-frequently used configurations. And after digging in I
found in 76f4108892 (Cleanup hrtimer accessors to the
timekepeing state), the hrtimer_get_softirq_time tai offset
calucation was incorrectly rewritten, as the tai offset we
return shold be from CLOCK_MONOTONIC, and not CLOCK_REALTIME.
This results in CLOCK_TAI timers expiring early on non-highres
capable machines.
This patch fixes the issue, calculating the tai time properly
from the monotonic base.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable <stable@vger.kernel.org> # 3.17+
Link: http://lkml.kernel.org/r/1423097126-10236-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of
the config and the code more consistent.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently the adjusments made as part of perf_event_task_tick() use the
percpu rotation lists to iterate over any active PMU contexts, but these
are not used by the context rotation code, having been replaced by
separate (per-context) hrtimer callbacks. However, some manipulation of
the rotation lists (i.e. removal of contexts) has remained in
perf_rotate_context(). This leads to the following issues:
* Contexts are not always removed from the rotation lists. Removal of
PMUs which have been placed in rotation lists, but have not been
removed by a hrtimer callback can result in corruption of the rotation
lists (when memory backing the context is freed).
This has been observed to result in hangs when PMU drivers built as
modules are inserted and removed around the creation of events for
said PMUs.
* Contexts which do not require rotation may be removed from the
rotation lists as a result of a hrtimer, and will not be considered by
the unthrottling code in perf_event_task_tick.
This patch fixes the issue by updating the rotation ist when events are
scheduled in/out, ensuring that each rotation list stays in sync with
the HW state. As each event holds a refcount on the module of its PMU,
this ensures that when a PMU module is unloaded none of its CPU contexts
can be in a rotation list. By maintaining a list of perf_event_contexts
rather than perf_event_cpu_contexts, we don't need separate paths to
handle the cpu and task contexts, which also makes the code a little
simpler.
As the rotation_list variables are not used for rotation, these are
renamed to active_ctx_list, which better matches their current function.
perf_pmu_rotate_{start,stop} are renamed to
perf_pmu_ctx_{activate,deactivate}.
Reported-by: Johannes Jensen <johannes.jensen@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <Will.Deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150129134511.GR17721@leverpostej
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When initialising an event, perf_init_event will call try_module_get() to
ensure that the PMU's module cannot be removed for the lifetime of the
event, with __free_event() dropping the reference when the event is
finally destroyed. If something fails after the event has been
initialised, but before the event is installed, perf_event_alloc will
drop the reference on the module.
However, if we fail to initialise an event for some reason (e.g. we ask
an uncore PMU to perform sampling, and it refuses to initialise the
event), we do not drop the refcount. If we try to open such a bogus
event without a precise IDR type, we will loop over each PMU in the pmus
list, incrementing each of their refcounts without decrementing them.
This patch adds a module_put when pmu->event_init(event) fails, ensuring
that the refcounts are balanced in failure cases. As the innards of the
precise and search based initialisation look very similar, this logic is
hoisted out into a new helper function. While the early return for the
failed try_module_get is removed from the search case, this is handled
by the remaining return when ret is not -ENOENT.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420642611-22667-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we flag available data (via poll syscall) on perf fd with
POLL_IN macro, which is normally used for SIGIO interface.
We've been lucky, because POLLIN (0x1) is subset of POLL_IN (0x20001)
and sys_poll (do_pollfd function) cut the extra bit out (0x20000).
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422467678-22341-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So what I suspect; but I'm in zombie mode today it seems; is that while
I initially thought that it was impossible for ctx to change when
refcount dropped to 0, I now suspect its possible.
Note that until perf_remove_from_context() the event is still active and
visible on the lists. So a concurrent sys_perf_event_open() from another
task into this task can race.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: mark.rutland@arm.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150129134434.GB26304@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over
the weekend:
event_sched_out.isra.79+0x2b9/0x2d0
group_sched_out+0x69/0xc0
ctx_sched_out+0x106/0x130
task_ctx_sched_out+0x37/0x70
__perf_install_in_context+0x70/0x1a0
remote_function+0x48/0x60
generic_exec_single+0x15b/0x1d0
smp_call_function_single+0x67/0xa0
task_function_call+0x53/0x80
perf_install_in_context+0x8b/0x110
I think the below should cure this; if we install a group leader it
will iterate the (still intact) group list and find its siblings and
try and install those too -- even though those still have the old
event->ctx -- in the new ctx.
Upon installing the first group sibling we'd try and schedule out the
group and trigger the above warn.
Fix this by installing the group leader last, installing siblings
would have no effect, they're not reachable through the group lists
and therefore we don't schedule them.
Also delay resetting the state until we're absolutely sure the events
are quiescent.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Reported-by: vincent.weaver@maine.edu
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150126162639.GA21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There have been a few reported issues wrt. the lack of locking around
changing event->ctx. This patch tries to address those.
It avoids the whole rwsem thing; and while it appears to work, please
give it some thought in review.
What I did fail at is sensible runtime checks on the use of
event->ctx, the RCU use makes it very hard.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150123125834.209535886@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a few WARN()s to catch things that should never happen.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150123125834.150481799@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We explicitly mark the task running after returning from
a __rt_mutex_slowlock() call, which does the actual sleeping
via wait-wake-trylocking. As such, this patch does two things:
(1) refactors the code so that setting current to TASK_RUNNING
is done by __rt_mutex_slowlock(), and not by the callers. The
downside to this is that it becomes a bit unclear when at what
point we block. As such I've added a comment that the task
blocks when calling __rt_mutex_slowlock() so readers can figure
out when it is running again.
(2) relaxes setting current's state through __set_current_state(),
instead of it's more expensive barrier alternative. There was no
need for the implied barrier as we're obviously not planning on
blocking.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422857784.18096.1.camel@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Call __set_task_state() instead of assigning the new state
directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP
environments, keeping track of who last changed the state.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422257769-14083-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "thread would block" case can be checked without grabbing ->wait.lock.
[ If the check does not return early then grab the lock and recheck.
A memory barrier is not needed as complete() and complete_all() imply
a barrier.
The ACCESS_ONCE() is needed for calls in a loop that, if inlined, could
optimize out the re-fetching of x->done. ]
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422013307-13200-1-git-send-email-der.herr@hofr.at
Signed-off-by: Ingo Molnar <mingo@kernel.org>
By the time we wake up and get the lock after being asleep
in the slowpath, we better be running. As good practice,
be explicit about this and avoid any mischief.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421717961.4903.11.camel@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The second 'mutex' shouldn't be there, it can't be about the mutex,
as the mutex can't be freed, but unlocked, the memory where the
mutex resides however, can be freed.
Signed-off-by: Sharon Dvir <sharon.dvir1@mail.huji.ac.il>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422827252-31363-1-git-send-email-sharon.dvir1@mail.huji.ac.il
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__schedule() disables preemption during its job and re-enables it
afterward without doing a preemption check to avoid recursion.
But if an event happens after the context switch which requires
rescheduling, we need to check again if a task of a higher priority
needs the CPU. A preempt irq can raise such a situation. To handle that,
__schedule() loops on need_resched().
But preempt_schedule_*() functions, which call __schedule(), also loop
on need_resched() to handle missed preempt irqs. Hence we end up with
the same loop happening twice.
Lets simplify that by attributing the need_resched() loop responsibility
to all __schedule() callers.
There is a risk that the outer loop now handles reschedules that used
to be handled by the inner loop with the added overhead of caller details
(inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner
rescheduling loop weren't too frequent, this shouldn't matter. Especially
since the whole preemption path is now losing one loop in any case.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1422404652-29067-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpu_active_mask is rarely changed (only on hotplug), so remove this
operation to gain a little performance.
If there is a change in cpu_active_mask, rq_online_dl() and
rq_offline_dl() should take care of it normally, so cpudl::free_cpus
carries enough information for us.
For the rare case when a task is put onto a dying cpu (which
rq_offline_dl() can't handle in a timely fashion), it will be
handled through _cpu_down()->...->multi_cpu_stop()->migration_call()
->migrate_tasks(), preventing the task from hanging on the
dead cpu.
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
[peterz: changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1421642980-10045-2-git-send-email-pang.xunlei@linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The commit 177ef2a631 ("sched/deadline: Fix a precision problem in
the microseconds range") forgot to change the UP version of
hrtick_start(), do so now.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 177ef2a631 ("sched/deadline: Fix a precision problem in the microseconds range")
[ Fixed the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to dequeue/enqueue and push/pull if there are
no scheduling parameters changed for the DL class.
Both fair and RT classes already check if parameters changed for
them to avoid unnecessary overhead. This patch add the parameters
changed test for the DL class in order to reduce overhead.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Fixed up the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we fail to start the deadline timer in update_curr_dl(), we
forget to clear ->dl_yielded, resulting in wrecked time keeping.
Since the natural place to clear both ->dl_yielded and ->dl_throttled
is in replenish_dl_entity(); both are after all waiting for that event;
make it so.
Luckily since 67dfa1b756 ("sched/deadline: Implement
cancel_dl_timer() to use in switched_from_dl()") the
task_on_rq_queued() condition in dl_task_timer() must be true, and can
therefore call enqueue_task_dl() unconditionally.
Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-4-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After update_curr_dl() the current task might not be the leftmost task
anymore. In that case do not start a new hrtick for it.
In this case NEED_RESCHED will be set and the next schedule will start
the hrtick for the new task if and when appropriate.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Acked-by: Juri Lelli <juri.lelli@arm.com>
[ Rewrote the changelog and comment. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 67dfa1b756 ("sched/deadline: Implement cancel_dl_timer() to
use in switched_from_dl()") removed the hrtimer_try_cancel() function
call out from init_dl_task_timer(), which gets called from
__setparam_dl().
The result is that we can now re-init the timer while its active --
this is bad and corrupts timer state.
Furthermore; changing the parameters of an active deadline task is
tricky in that you want to maintain guarantees, while immediately
effective change would allow one to circumvent the CBS guarantees --
this too is bad, as one (bad) task should not be able to affect the
others.
Rework things to avoid both problems. We only need to initialize the
timer once, so move that to __sched_fork() for new tasks.
Then make sure __setparam_dl() doesn't affect the current running
state but only updates the parameters used to calculate the next
scheduling period -- this guarantees the CBS functions as expected
(albeit slightly pessimistic).
This however means we need to make sure __dl_clear_params() needs to
reset the active state otherwise new (and tasks flipping between
classes) will not properly (re)compute their first instance.
Todo: close class flipping CBS hole.
Todo: implement delayed BW release.
Reported-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Fixes: 67dfa1b756 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hibernate_preallocate_memory() prints out that how many pages are
allocated, but it doesn't take into consideration the pages freed by
free_unnecessary_pages(). Therefore, it always shows the count more
than actually allocated.
Signed-off-by: Wonhong Kwon <wonhong.kwon@lge.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The tracing "instances" directory can create sub tracing buffers
with mkdir, and remove them with rmdir. As a mkdir will also create
all the files and directories that control the sub buffer the inode
mutexes need to be released before this is done, to avoid deadlocks.
It is better to let the tracing system unlock the inode mutexes before
calling the functions that create the files within the new directory
(or deletes the files from the one being destroyed).
Now that tracing has been converted over to tracefs, the tracefs file
system can be modified to accommodate this feature. It still releases
the locks, but the filesystem itself can take care of the ugly
business and let the user just do what it needs.
The tracing system now attaches a descriptor to the directory dentry
that can have userspace create or remove sub directories. If this
descriptor does not exist for a dentry, then that dentry can not be
used to create other directories. This descriptor holds a mkdir and
rmdir method that only takes a character string as an argument.
The tracefs file system will first make a copy of the dentry name
before releasing the locks. Then it will pass the copied name to the
methods. It is up to the tracing system that supplied the methods to
handle races with duplicate names and such as all the inode mutexes
would be released when the functions are called.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As tools currently rely on the tracing directory in debugfs, we can not
just created a tracefs infrastructure and expect sysadmins to mount
the new tracefs to have their old tools work.
Instead, the debugfs tracing directory is still created and the tracefs
file system is mounted there when the debugfs filesystem is mounted.
No longer does the tracing infrastructure update the debugfs file system,
but instead interacts with the tracefs file system. But now, it still
appears to the user like nothing changed, except you also have the feature
of mounting just the tracing system without needing all of debugfs!
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
debugfs was fine for the tracing facility as a quick way to get
an interface. Now that tracing has matured, it should separate itself
from debugfs such that it can be mounted separately without needing
to mount all of debugfs with it. That is, users resist using tracing
because it requires mounting debugfs. Having tracing have its own file
system lets users get the features of tracing without needing to bring
in the rest of the kernel's debug infrastructure.
Another reason for tracefs is that debubfs does not support mkdir.
Currently, to create instances, one does a mkdir in the tracing/instance
directory. This is implemented via a hack that forces debugfs to do
something it is not intended on doing. By converting over to tracefs, this
hack can be removed and mkdir can be properly implemented. This patch does
not address this yet, but it lays the ground work for that to be done.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The options for cmdline tracers are not created if the debugfs system
is not ready yet. If tracing has started before debugfs is up, then the
option files for the tracer are not created. Create them when creating
the tracing directory if the current tracer requires option files.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not bother creating tracer options if no tracing directory
exists. If a tracer is enabled via the command line, and is
started before the tracing directory is created, then it wont have
its tracer specific options created.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUzvgKAAoJEHm+PkMAQRiG8XQH/1qVbHI4pP0KcnzfZUHq/mXq
RuS4aJMwLm/Y6cXFraXBDaPde1A3CPtwtpob2C6giKcfu2zXGunY65haOEeJWNpX
lCbBsLkNC3oDNkygBpVr5Zd6yibaw63WBjjLnpAi7pn2G2Zm2zB8DfILWWWMb7yz
MH8ZXV+/xIYCTkjNWGWA1iMjmdYqu0PQHPeOgLsYQ+u7rxfM1zb/wHEkjqUZS6iu
IaaZv7PV2PnFYnqib/iIPYjAEDvSQ4vN/7b82zlFd2Culm9j/568KCCWUPhJTb2l
X0u4QYs49GnMTWVRa3bgYxS/nTUaE/6DeWs2y2WzqTt0/XDntVUnok0blUeDxGk=
=o2kS
-----END PGP SIGNATURE-----
Merge tag 'v3.19-rc7' into x86/asm, to refresh the branch before pulling in new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull in Al Viro's changes to debugfs that implement the new primitive:
debugfs_create_automount(), that creates a directory in debugfs that will
safely mount another file system automatically when debugfs is mounted.
This will let tracefs automount itself on top of debugfs/tracing directory.
The top level trace array is treated a little different than the
instances, as it has to deal with more of the general tracing.
The tr->dir is the tracing directory, which is an immutable
dentry, where as the tr->dir of instances are the dentry that
was created, and can be destroyed later. These should have different
functions accessing them.
As only tracing_init_dentry() deals with the top level array, fold
the code for it into that function, and remove the trace_init_dentry_tr()
that was also used by the instances to get their directory dentry.
Add a tracing_get_dentry() to just get the tracing dir entry for
instances as well as the top level array.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
(struct perf_pmu_events_attr) is defined in include/linux/perf_event.h,
but the only "show" for it is in x86 and contains x86 specific stuff.
Make a generic one for those of us who are just using the event_str.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 8eb23b9f35 ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.
However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).
And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.
In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont. But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.
This fixes both cases:
- don't set TASK_RUNNING when the nested condition happens (note: even
if WARN_ONCE() only _warns_ once, the return value isn't whether the
warning happened, but whether the condition for the warning was true.
So despite the warning only happening once, the "if (WARN_ON(..))"
would trigger for every nested sleep.
- in the cases where we knowingly disable the warning by using
"sched_annotate_sleep()", don't change the task state (that is used
for all core scheduling decisions), instead use '->task_state_change'
that is used for the debugging decision itself.
(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)
Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also an event groups fix, two PMU driver
fixes and a CPU model variant addition"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Tighten (and fix) the grouping condition
perf/x86/intel: Add model number for Airmont
perf/rapl: Fix crash in rapl_scale()
perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
perf probe: Fix probing kretprobes
perf symbols: Introduce 'for' method to iterate over the symbols with a given name
perf probe: Do not rely on map__load() filter to find symbols
perf symbols: Introduce method to iterate symbols ordered by name
perf symbols: Return the first entry with a given name in find_by_name method
perf annotate: Fix memory leaks in LOCK handling
perf annotate: Handle ins parsing failures
perf scripting perl: Force to use stdbool
perf evlist: Remove extraneous 'was' on error message
Currently, cpudl::free_cpus contains all CPUs during init, see
cpudl_init(). When calling cpudl_find(), we have to add rd->span
to avoid selecting the cpu outside the current root domain, because
cpus_allowed cannot be depended on when performing clustered
scheduling using the cpuset, see find_later_rq().
This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
so we can avoid the rd->span operation when calling cpudl_find()
in find_later_rq().
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421642980-10045-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or
tick_check_broadcast_expired() returns true. The exit condition from
cpu_idle_poll() is tif_need_resched().
However this does not take into account scenarios where cpu_idle_force_poll
changes or tick_check_broadcast_expired() returns false, without setting
the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly,
thereby wasting power. Add an explicit check on cpu_idle_force_poll and
tick_check_broadcast_expired() to the exit condition of cpu_idle_poll()
to avoid this.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150121105655.15279.59626.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If an interrupt fires in cond_resched(), between the call to __schedule()
and the PREEMPT_ACTIVE count decrementation, and that interrupt sets
TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
being delayed because it's interrupting a preempt-disabled area, is
usually fixed up after preemption is re-enabled back with an explicit
call to preempt_schedule().
This is what preempt_enable() does but a raw preempt count decrement as
performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
preemption check. Therefore when such a race happens, the rescheduling
is going to be delayed until the next scheduler or preemption entrypoint.
This can be a problem for scheduler latency sensitive workloads.
Lets fix that by consolidating cond_resched() with preempt_schedule()
internals.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@kernel.org>
Original-patch-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1421946484-9298-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds checks that prevens futile attempts to move rt tasks
to a CPU with active tasks of equal or higher priority.
This reduces run queue lock contention and improves the performance of
a well known OLTP benchmark by 0.7%.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Cc: Suruchi Kadu <suruchi.a.kadu@intel.com>
Cc: Doug Nelson<doug.nelson@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export the suspend_resume tracepoint so it can be used
in loadable modules.
Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the kernel is compiled with function tracer support the -pg compile option
is passed to gcc to generate extra code into the prologue of each function.
This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
makefile variable which architectures can override if a different option
should be used for code generation.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The ring_buffer_producer uses 'struct timeval' to measure
its start and end times. 'struct timeval' on 32-bit systems
will have its tv_sec value overflow in year 2038 and beyond.
This patch replaces struct timeval with 'ktime_t' which uses
64-bit representation for nanoseconds.
Link: http://lkml.kernel.org/r/20150128141611.GA2701@tinar
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a trace event contains an array, there is currently no standard
way to format this for text output. Drivers are currently hacking
around this by a) local hacks that use the trace_seq functionailty
directly, or b) just not printing that information. For fixed size
arrays, formatting of the elements can be open-coded, but this gets
cumbersome for arrays of non-trivial size.
These approaches result in non-standard content of the event format
description delivered to userspace, so userland tools needs to be
taught to understand and parse each array printing method
individually.
This patch implements a __print_array() helper that tracepoint
implementations can use instead of reinventing it. A simple C-style
syntax is used to delimit the array and its elements {like,this}.
So that the helpers can be used with large static arrays as well as
dynamic arrays, they take a pointer and element count: they can be
used with __get_dynamic_array() for use with dynamic arrays.
Link: http://lkml.kernel.org/r/1422449335-8289-2-git-send-email-javi.merino@arm.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
of this is an IST rework. When an IST exception interrupts user
space, we will handle it on the per-thread kernel stack instead of
on the IST stack. This sounds messy, but it actually simplifies the
IST entry/exit code, because it eliminates some ugly games we used
to play in order to handle rescheduling, signal delivery, etc on the
way out of an IST exception.
The IST rework introduces proper context tracking to IST exception
handlers. I haven't seen any bug reports, but the old code could
have incorrectly treated an IST exception handler as an RCU extended
quiescent state.
The memory failure change (included in this pull request with
Borislav and Tony's permission) eliminates a bunch of code that
is no longer needed now that user memory failure handlers are
called in process context.
Finally, this includes a few on Denys' uncontroversial and Obviously
Correct (tm) cleanups.
The IST and memory failure changes have been in -next for a while.
LKML references:
IST rework:
http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net
Memory failure change:
http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com
Denys' cleanups:
http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUtvkFAAoJEK9N98ZeDfrkcfsIAJxZ0UBUCEDvulbqgk/iPGOa
fIpKLMowS7CpKtw6Wdc/YvAIkeHXWm1vU44Hj0TrjSrXCgVF8yCngs/xlXtOjoa1
dosXQqgqVJJ+hyui7chAEWyalLW7bEO8raq/6snhiMrhiuEkVKpEr7Fer4FVVCZL
4VALmNQQsbV+Qq4pXIhuagZC0Nt/XKi/+/cKvhS4p//q1F/TbHTz0FpDUrh0jPMh
18WFy0jWgxdkMRnSp/wJhekvdXX6PwUy5BdES9fjw8LQJZxxFpqN3Fe1kgfyzV0k
yuvEHw1hPt2aBGj3q69wQvDVyyn4OqMpRDBhk4S+GJYmVh7mFyFMN4BDMEy/EY8=
=LXVl
-----END PGP SIGNATURE-----
Merge tag 'pr-20150114-x86-entry' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/asm
Pull x86/entry enhancements from Andy Lutomirski:
" This is my accumulated x86 entry work, part 1, for 3.20. The meat
of this is an IST rework. When an IST exception interrupts user
space, we will handle it on the per-thread kernel stack instead of
on the IST stack. This sounds messy, but it actually simplifies the
IST entry/exit code, because it eliminates some ugly games we used
to play in order to handle rescheduling, signal delivery, etc on the
way out of an IST exception.
The IST rework introduces proper context tracking to IST exception
handlers. I haven't seen any bug reports, but the old code could
have incorrectly treated an IST exception handler as an RCU extended
quiescent state.
The memory failure change (included in this pull request with
Borislav and Tony's permission) eliminates a bunch of code that
is no longer needed now that user memory failure handlers are
called in process context.
Finally, this includes a few on Denys' uncontroversial and Obviously
Correct (tm) cleanups.
The IST and memory failure changes have been in -next for a while.
LKML references:
IST rework:
http://lkml.kernel.org/r/cover.1416604491.git.luto@amacapital.net
Memory failure change:
http://lkml.kernel.org/r/54ab2ffa301102cd6e@agluck-desk.sc.intel.com
Denys' cleanups:
http://lkml.kernel.org/r/1420927210-19738-1-git-send-email-dvlasenk@redhat.com
"
This tree semantically depends on and is based on the following RCU commit:
734d168013 ("rcu: Make rcu_nmi_enter() handle nesting")
... and for that reason won't be pushed upstream before the RCU bits hit Linus's tree.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The fix from 9fc81d8742 ("perf: Fix events installation during
moving group") was incomplete in that it failed to recognise that
creating a group with events for different CPUs is semantically
broken -- they cannot be co-scheduled.
Furthermore, it leads to real breakage where, when we create an event
for CPU Y and then migrate it to form a group on CPU X, the code gets
confused where the counter is programmed -- triggered in practice
as well by me via the perf fuzzer.
Fix this by tightening the rules for creating groups. Only allow
grouping of counters that can be co-scheduled in the same context.
This means for the same task and/or the same cpu.
Fixes: 9fc81d8742 ("perf: Fix events installation during moving group")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.
Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.
[ mingo: changed the solution to zero initialization: uninitialized_var()
needs to die, as it's an actively dangerous construct: if in the future
a known-proven-good piece of code is changed to have a true, buggy
uninitialized variable, the compiler warning is then supressed...
The better long term solution is to clean up the code flow, so that
even simple minded compilers (and humans!) are able to read it without
getting a headache. ]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.c
Two simple sets of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the output-confusing newline below:
[ 0.191328]
**********************************************************
[ 0.191493] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.191586] ** **
...
Link: http://lkml.kernel.org/r/1422375440-31970-1-git-send-email-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
[ added an extra '\n' by itself, to keep what it was suppose to do ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull networking fixes from David Miller:
1) Don't OOPS on socket AIO, from Christoph Hellwig.
2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
Grumbach.
3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.
4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
Starovoitov.
5) Lots of crash, memory leak, short TX packet et al bug fixes in
sh_eth from Ben Hutchings.
6) Fix memory corruption in SCTP wrt. INIT collitions, from Daniel
Borkmann.
7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
From Eric Dumazet and Govindarajulu Varadarajan.
8) Header length calculation fix in mac80211 from Fred Chou.
9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
From Ezequiel Garcia.
10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
tcp diag used. From Herbert Xu.
11) amd-xgbe programs wrong rx flow control register, from Thomas
Lendacky.
12) Fix race leading to use after free in ping receive path, from Subash
Abhinov Kasiviswanathan.
13) Cache redirect routes otherwise we can get a heavy backlog of rcu
jobs liberating DST_NOCACHE entries. From Hannes Frederic Sowa.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
net: don't OOPS on socket aio
stmmac: prevent probe drivers to crash kernel
bnx2x: fix napi poll return value for repoll
ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
sh_eth: Fix DMA-API usage for RX buffers
sh_eth: Check for DMA mapping errors on transmit
sh_eth: Ensure DMA engines are stopped before freeing buffers
sh_eth: Remove RX overflow log messages
ping: Fix race in free in receive path
udp_diag: Fix socket skipping within chain
can: kvaser_usb: Fix state handling upon BUS_ERROR events
can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
can: kvaser_usb: Send correct context to URB completion
can: kvaser_usb: Do not sleep in atomic context
ipv4: try to cache dst_entries which would cause a redirect
samples: bpf: relax test_maps check
bpf: rcu lock must not be held when calling copy_to_user()
net: sctp: fix slab corruption from use after free on INIT collisions
net: mv643xx_eth: Fix highmem support in non-TSO egress path
sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
...
BUG: sleeping function called from invalid context at mm/memory.c:3732
in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
1 lock held by test_maps/671:
#0: (rcu_read_lock){......}, at: [<0000000000264190>] map_lookup_elem+0xe8/0x260
Call Trace:
([<0000000000115b7e>] show_trace+0x12e/0x150)
[<0000000000115c40>] show_stack+0xa0/0x100
[<00000000009b163c>] dump_stack+0x74/0xc8
[<000000000017424a>] ___might_sleep+0x23a/0x248
[<00000000002b58e8>] might_fault+0x70/0xe8
[<0000000000264230>] map_lookup_elem+0x188/0x260
[<0000000000264716>] SyS_bpf+0x20e/0x840
Fix it by allocating temporary buffer to store map element value.
Fixes: db20fd2b01 ("bpf: add lookup/update/delete/iterate methods to BPF maps")
Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup fix from Tejun Heo:
"The lifetime rules of cgroup hierarchies always have been somewhat
counter-intuitive and cgroup core tried to enforce that hierarchies
w/o userland-visible usages must die in finite amount of time so that
the controllers can be reused for other hierarchies; unfortunately,
this can't be implemented reasonably for the memory controller - the
kmemcg part doesn't have any way to forcefully drain the existing
usages, leading to an interruptible hang if a following mount attempts
to use the controller in any way.
So, it seems like we're stuck with "hierarchies live on till they die
whenever that may be" at least for now. This pretty much confines
attaching controllers to hierarchies to before the hierarchies are
actively used by making dynamic configurations post active usages
unreliable. This has never been reliable and should be fine in
practice given how cgroups are used.
After the patch, hierarchies aren't killed if it isn't already
drained. A following mount attempt of the same mount options will
reuse the existing hierarchy. Mount attempts with differing options
will fail w/ -EBUSY"
* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: prevent mount hang due to memory controller lifetime
Replace the old ns->bacct only with NULL and only if it still points
to acct. And assign the new value to it *before* calling acct_kill()
in acct_on(). That way we don't need to pass the new acct to acct_kill().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull x86 fixes from Thomas Gleixner:
"Hopefully the last round of fixes for 3.19
- regression fix for the LDT changes
- regression fix for XEN interrupt handling caused by the APIC
changes
- regression fixes for the PAT changes
- last minute fixes for new the MPX support
- regression fix for 32bit UP
- fix for a long standing relocation issue on 64bit tagged for stable
- functional fix for the Hyper-V clocksource tagged for stable
- downgrade of a pr_err which tends to confuse users
Looks a bit on the large side, but almost half of it are valuable
comments"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Change Fast TSC calibration failed from error to info
x86/apic: Re-enable PCI_MSI support for non-SMP X86_32
x86, mm: Change cachemode exports to non-gpl
x86, tls: Interpret an all-zero struct user_desc as "no segment"
x86, tls, ldt: Stop checking lm in LDT_empty
x86, mpx: Strictly enforce empty prctl() args
x86, mpx: Fix potential performance issue on unmaps
x86, mpx: Explicitly disable 32-bit MPX support on 64-bit kernels
x86, hyperv: Mark the Hyper-V clocksource as being continuous
x86: Don't rely on VMWare emulating PAT MSR correctly
x86, irq: Properly tag virtualization entry in /proc/interrupts
x86, boot: Skip relocs when load address unchanged
x86/xen: Override ACPI IRQ management callback __acpi_unregister_gsi
ACPI: pci: Do not clear pci_dev->irq in acpi_pci_irq_disable()
x86/xen: Treat SCI interrupt as normal GSI interrupt
Pull timer fixes from Thomas Gleixner:
"A set of small fixes:
- regression fix for exynos_mct clocksource
- trivial build fix for kona clocksource
- functional one liner fix for the sh_tmu clocksource
- two validation fixes to prevent (root only) data corruption in the
kernel via settimeofday and adjtimex. Tagged for stable"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: adjtimex: Validate the ADJ_FREQUENCY values
time: settimeofday: Validate the values of tv from user
clocksource: sh_tmu: Set cpu_possible_mask to fix SMP broadcast
clocksource: kona: fix __iomem annotation
clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write
kernel/time/hrtimer.c:444:9: sparse: symbol '__hrtimer_get_next_event' was not declared. Should it be static?
Fixes: 9bc7491906 hrtimer: Prevent stale expiry time in hrtimer_interrupt()
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: kbuild-all@01.org
Link: http://lkml.kernel.org/r/20150123121206.GA4766@snb
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* ktime division optimization
* Expose a few more y2038-safe timekeeping interfaces
* RTC core changes to address y2038
Signed-off-by: John Stultz <john.stultz@linaro.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUwvXJAAoJEK8vClot3jMxTAoH/1DMT3fuVx6RFjKJ/P1abIB+
+w3cfEgEWgkSwYmuS0XHq1WppnQ0p0n1GOJcWUPiP9tTGrKcTdp5uG5qMprcga3q
XoeR8wefkyEKyH4ukStdGKQKot2Vj117TauDtVNPf2eOOBS5pqOw1dYUlwjlMtOj
45poW5ORNKmBMn90e22k8nlNSI9PebvMh9w6nzeYJWEibdyk96z2TOk1puPTvws/
ppyNzlhnKckpNb49JVxE8B4DNRpXsUV+aUxRNyRPN4OdqCGzHwIJCyEKi6+nbRyb
4HMUhfl8eRB2Iu7zHF2a2XEOqJdOjl8i1DsTwr3Vwd3crf4XkXD6WtTtGl2YKkU=
=YhDu
-----END PGP SIGNATURE-----
Merge tag 'fortglx-3.20-time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull time updates from John Stultz for 3.20:
* ktime division optimization
* Expose a few more y2038-safe timekeeping interfaces
* RTC core changes to address y2038
rtc_set_ntp_time() uses timespec which is y2038-unsafe,
so modify to use timespec64 which is y2038-safe, then
replace rtc_time_to_tm() with rtc_time64_to_tm().
Also adjust all its call sites(only NTP uses it) accordingly.
Cc: pang.xunlei <pang.xunlei@linaro.org>
Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Adds a timespec64 based getboottime64() implementation
that can be used as we convert internal users of
getboottime away from using timespecs.
Cc: pang.xunlei <pang.xunlei@linaro.org>
Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
At least on ARM, do_div() is optimized to turn constant divisors into
an inline multiplication by the reciprocal value at compile time.
However this optimization is missed entirely whenever ktime_divns() is
used and the slow out-of-line division code is used all the time.
Let ktime_divns() use do_div() inline whenever the divisor is constant
and small enough. This will make things like ktime_to_us() and
ktime_to_ms() much faster.
Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nicolas Pitre <nico@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Remove the function get_safe_write_buffer() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
PM QoS requests are notoriously hard to debug and made even
more so due to their highly dynamic nature. Having visibility
into the internal data representation per constraint allows
us to have much better appreciation of potential issues or
bad usage by drivers in the system.
So introduce for all classes of PM QoS, an entry in
/sys/kernel/debug/pm_qos that shall show all the current
requests as well as the snapshot of the value these requests
boil down to. For example:
==> /sys/kernel/debug/pm_qos/cpu_dma_latency <==
1: 4444: Active
2: 2000000000: Default
3: 2000000000: Default
4: 2000000000: Default
Type=Minimum, Value=4444, Requests: active=1 / total=4
==> /sys/kernel/debug/pm_qos/memory_bandwidth <==
Empty!
...
The actual value listed will have their meaning based
on the QoS it is on, the 'Type' indicates what logic
it would use to collate the information - Minimum,
Maximum, or Sum. Value is the collation of all requests.
This interface also compares the values with the defaults
for the QoS class and marks the ones that are
currently active.
Signed-off-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Dave Gerlach <d-gerlach@ti.com>
Acked-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Every kernel build that includes X.509 support prints out
a message like
- Including cert signing_key.x509
This may be useful for some cases, but when doing automated
build tests, it just means noise.
To hide the message, this uses '$(kecho)' for printing the
message, which means we still see it when building with V=1,
but not at the normal level or when building with 'make -s'.
Signed-off-by: Arnd Bergmann <arnd@arnd.de>
Signed-off-by: David Howells <dhowells@redhat.com>
hrtimer_interrupt() has the following subtle issue:
hrtimer_interrupt()
lock(cpu_base);
expires_next = KTIME_MAX;
expire_timers(CLOCK_MONOTONIC);
expires = get_next_timer(CLOCK_MONOTONIC);
if (expires < expires_next)
expires_next = expires;
expire_timers(CLOCK_REALTIME);
unlock(cpu_base);
wakeup()
hrtimer_start(CLOCK_MONOTONIC, newtimer);
lock(cpu_base();
expires = get_next_timer(CLOCK_REALTIME);
if (expires < expires_next)
expires_next = expires;
So because we already evaluated the next expiring timer of
CLOCK_MONOTONIC we ignore that the expiry time of newtimer might be
earlier than the overall next expiry time in hrtimer_interrupt().
To solve this, remove the caching of the next expiry value from
hrtimer_interrupt() and reevaluate all active clock bases for the next
expiry value. To avoid another code duplication, create a shared
evaluation function and use it for hrtimer_get_next_event(),
hrtimer_force_reprogram() and hrtimer_interrupt().
There is another subtlety in this mechanism:
While hrtimer_interrupt() is running, we want to avoid to touch the
hardware device because we will reprogram it anyway at the end of
hrtimer_interrupt(). This works nicely for hrtimers which get rearmed
via the HRTIMER_RESTART mechanism, because we drop out when the
callback on that CPU is running. But that fails, if a new timer gets
enqueued like in the example above.
This has another implication: While hrtimer_interrupt() is running we
refuse remote enqueueing of timers - see hrtimer_interrupt() and
hrtimer_check_target().
hrtimer_interrupt() tries to prevent this by setting cpu_base->expires
to KTIME_MAX, but that fails if a new timer gets queued.
Prevent both the hardware access and the remote enqueue
explicitely. We can loosen the restriction on the remote enqueue now
due to reevaluation of the next expiry value, but that needs a
seperate patch.
Folded in a fix from Vignesh Radhakrishnan.
Reported-and-tested-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Based-on-patch-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: vigneshr@codeaurora.org
Cc: john.stultz@linaro.org
Cc: viresh.kumar@linaro.org
Cc: fweisbec@gmail.com
Cc: cl@linux.com
Cc: stuart.w.hayes@gmail.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1501202049190.5526@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Problem:
The default behavior of the kernel is somewhat undesirable as all
requested interrupts end up on CPU0 after registration. A user can
run irqbalance daemon, or can manually configure smp_affinity via the
proc filesystem, but the default affinity of the interrupts for all
devices is always CPU zero, this can cause performance problems or
very heavy cpu use of only one core if not noticed and fixed by the
user.
Solution:
Enable the setting of the initial affinity directly when the driver
sets a hint.
This enabling means that kernel drivers can include an initial
affinity setting for the interrupt, instead of all interrupts starting
out life on CPU0. Of course if irqbalance is still running then the
interrupts will get moved as before.
This function is currently called by drivers in block, crypto,
infiniband, ethernet and scsi trees, but only a handful, so these will
be the devices affected by this change.
Tested on i40e, and default interrupts were spread across the CPUs
according to the hint.
drivers/block/mtip32xx/mtip32xx.c:3
drivers/block/nvme-core.c:2
drivers/crypto/qat/qat_dh895xcc/adf_isr.c:3
drivers/infiniband/hw/qib/qib_iba7322.c:2
drivers/net/ethernet/intel/i40e/i40e_main.c:3
drivers/net/ethernet/intel/i40evf/i40evf_main.c:3
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3
drivers/net/ethernet/mellanox/mlx4/en_cq.c:2
drivers/scsi/hpsa.c:3
drivers/scsi/lpfc/lpfc_init.c:3
drivers/scsi/megaraid/megaraid_sas_base.c:8
drivers/soc/ti/knav_qmss_acc.c:1
drivers/soc/ti/knav_qmss_queue.c:2
drivers/virtio/virtio_pci_common.c:2
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20141219012206.4220.27491.stgit@jbrandeb-cp2.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The following race exists in the smpboot percpu threads management:
CPU0 CPU1
cpu_up(2)
get_online_cpus();
smpboot_create_threads(2);
smpboot_register_percpu_thread();
for_each_online_cpu();
__smpboot_create_thread();
__cpu_up(2);
This results in a missing per cpu thread for the newly onlined cpu2 and
in a NULL pointer dereference on a consecutive offline of that cpu.
Proctect smpboot_register_percpu_thread() with get_online_cpus() to
prevent that.
[ tglx: Massaged changelog and removed the change in
smpboot_unregister_percpu_thread() because that's an
optimization and therefor not stable material. ]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1406777421-12830-1-git-send-email-laijs@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to ensure that filenames are not released before the audit
subsystem is done with the strings there are a number of hacks built
into the fs and audit subsystems around getname() and putname(). To
say these hacks are "ugly" would be kind.
This patch removes the filename hackery in favor of a more
conventional reference count based approach. The diffstat below tells
most of the story; lots of audit/fs specific code is replaced with a
traditional reference count based approach that is easily understood,
even by those not familiar with the audit and/or fs subsystems.
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In all likelihood there were some subtle, and perhaps not so subtle,
bugs with filename matching in audit_inode() and audit_inode_child()
for some time, however, recent changes to the audit filename code have
definitely broken the filename matching code. The breakage could
result in duplicate filenames in the audit log and other odd audit
record entries. This patch fixes the filename matching code and
restores some sanity to the filename audit records.
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Enable recording of filenames in getname_kernel() and remove the
kludgy workaround in __audit_inode() now that we have proper filename
logging for kernel users.
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Description from Michael Kerrisk. He suggested an identical patch
to one I had already coded up and tested.
commit fe3d197f84 "x86, mpx: On-demand kernel allocation of bounds
tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
that unused arguments are zero, as is done in many existing prctl()s
and as should be done for all new prctl()s. This patch adds the
required checks.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
First two are minor fallout from the param rework which went in this merge
window.
Next three are a series which fixes a longstanding (but never previously
reported and unlikely , so no CC stable) race between kallsyms and freeing
the init section.
Finally, a minor cleanup as our module refcount will now be -1 during
unload.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUwEmwAAoJENkgDmzRrbjx77kP/1cNQR2eG2sBwokg3q0tvHnQ
IKqEXErW7NvxRa+RAMEmy2uQoGt6+uNklAbtyJEYM9oR1NieFbPi2yrt9Xn5SAXS
Brp1S8WYBMilA3W3o6I0trFDRWHdpdtkKIQwLWgJNSEWjbTXh8bSwp/2X1rlOPyI
ZmphCMOQMU2/uFEyJhTz1WMEV8eVXiRLN8OxSkPxToxdZoGln2U8IBCCCJC9OG+f
Cf3eMgEcNdEXNcPKqr11NIcHkAx6M6qI/eMDOqk151PslHa8lbis6di9Z87aE0ps
i8PyrkJGTmgM9cCjXwE8deNseeCmuKYlbPIF+NoxcqtvZstfaMrISwTIEuzV4JHi
p13YhDxy4XiC3H6pKHub/jo7UCl+wWtFh9SqpqGgduFX/p6FtUHQJm0S0X/DFFZt
C+2MFVSe6HRHE8B7bFz86+619Qd/rU7+806CLCE+NbYlYAKIBYKzWt/bml6VH3RJ
OjwXhQqmznWhJjsfD3BUUUpZpHijmylI9gAe2F1oErb8YjRU6gIm7P8hlkOzD7AS
TfGHPFq2raQcfAiGdVmvkbvvhvYZXnB3WVsAexrYoqrT9I8eEfRI+7SkL75MLR2E
ikzhJS3SHkAUAd7fUVMt7xMwh0jmhsPjWCCqc13m6UUFoXhTaDgKgPGftltN0bI2
g85+enZ3/eca6xh/KxvW
=Kf9b
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module and param fixes from Rusty Russell:
"Surprising number of fixes this merge window :(
The first two are minor fallout from the param rework which went in
this merge window.
The next three are a series which fixes a longstanding (but never
previously reported and unlikely , so no CC stable) race between
kallsyms and freeing the init section.
Finally, a minor cleanup as our module refcount will now be -1 during
unload"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: make module_refcount() a signed integer.
module: fix race in kallsyms resolution during module load success.
module: remove mod arg from module_free, rename module_memfree().
module_arch_freeing_init(): new hook for archs before module->module_init freed.
param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC
param: initialize store function to NULL if not available.
tracing_init_dentry() will soon return NULL as a valid pointer for the
top level tracing directroy. NULL can not be used as an error value.
Instead, switch to ERR_PTR() and check the return status with
IS_ERR().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The creation of tracing files and directories is for the most part
encapsulated in helper functions in trace.c. Other files do not need to
include debugfs.h or fs.h, as they may have needed to in the past.
Remove them from the files that do not need them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since b2052564e6 ("mm: memcontrol: continue cache reclaim from
offlined groups"), re-mounting the memory controller after using it is
very likely to hang.
The cgroup core assumes that any remaining references after deleting a
cgroup are temporary in nature, and synchroneously waits for them, but
the above-mentioned commit has left-over page cache pin its css until
it is reclaimed naturally. That being said, swap entries and charged
kernel memory have been doing the same indefinite pinning forever, the
bug is just more likely to trigger with left-over page cache.
Reparenting kernel memory is highly impractical, which leaves changing
the cgroup assumptions to reflect this: once a controller has been
mounted and used, it has internal state that is independent from mount
and cgroup lifetime. It can be unmounted and remounted, but it can't
be reconfigured during subsequent mounts.
Don't offline the controller root as long as there are any children,
dead or alive. A remount will no longer wait for these old references
to drain, it will simply mount the persistent controller state again.
Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
James Bottomley points out that it will be -1 during unload. It's
only used for diagnostics, so let's not hide that as it could be a
clue as to what's gone wrong.
Cc: Jason Wessel <jason.wessel@windriver.com>
Acked-and-documention-added-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Reviewed-by: Masami Hiramatsu <maasami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fix a potentially uninitialized return value in klp_enable_func().
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- Preemptible-RCU fixes, including fixing an old bug in the
interaction of RCU priority boosting and CPU hotplug.
- SRCU updates.
- RCU CPU stall-warning updates.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull workqueue fix from Tejun Heo:
"The xfs folks have been running into weird and very rare lockups for
some time now. I didn't think this could have been from workqueue
side because no one else was reporting it. This time, Eric had a
kdump which we looked into and it turned out this actually was a
workqueue bug and the bug has been there since the beginning of
concurrency managed workqueue.
A worker pool ensures forward progress of the workqueues associated
with it by always having at least one worker reserved from executing
work items. When the pool is under contention, the idle one tries to
create more workers for the pool and if that doesn't succeed quickly
enough, it calls the rescuers to the pool.
This logic had a subtle race condition in an early exit path. When a
worker invokes this manager function, the function may return %false
indicating that the caller may proceed to executing work items either
because another worker is already performing the role or conditions
have changed and the pool is no longer under contention.
The latter part depended on the assumption that whether more workers
are necessary or not remains stable while the pool is locked; however,
pool->nr_running (concurrency count) may change asynchronously and it
getting bumped from zero asynchronously could send off the last idle
worker to execute work items.
The race window is fairly narrow, and, even when it gets triggered,
the pool deadlocks iff if all work items get blocked on pending work
items of the pool, which is highly unlikely but can be triggered by
xfs.
The patch removes the race window by removing the early exit path,
which doesn't server any purpose anymore anyway"
* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix subtle pool management issue which can stall whole worker_pool
Add support for patching a function multiple times. If multiple patches
affect a function, the function in the most recently enabled patch
"wins". This enables a cumulative patch upgrade path, where each patch
is a superset of previous patches.
This requires restructuring the data a little bit. With the current
design, where each klp_func struct has its own ftrace_ops, we'd have to
unregister the old ops and then register the new ops, because
FTRACE_OPS_FL_IPMODIFY prevents us from having two ops registered for
the same function at the same time. That would leave a regression
window where the function isn't patched at all (not good for a patch
upgrade path).
This patch replaces the per-klp_func ftrace_ops with a global klp_ops
list, with one ftrace_ops per original function. A single ftrace_ops is
shared between all klp_funcs which have the same old_addr. This allows
the switch between function versions to happen instantaneously by
updating the klp_ops struct's func_stack list. The winner is the
klp_func at the top of the func_stack (front of the list).
[ jkosina@suse.cz: turn WARN_ON() into WARN_ON_ONCE() in ftrace handler to
avoid storm in pathological cases ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Only allow the topmost patch on the stack to be enabled or disabled, so
that patches can't be removed or added in an arbitrary order.
Suggested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Should have been removed with commit 18900909 ("audit: remove the old
depricated kernel interface").
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING in Kconfigs. HAVE_
bools are prevalent there and we should go with the flow.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The kallsyms routines (module_symbol_name, lookup_module_* etc) disable
preemption to walk the modules rather than taking the module_mutex:
this is because they are used for symbol resolution during oopses.
This works because there are synchronize_sched() and synchronize_rcu()
in the unload and failure paths. However, there's one case which doesn't
have that: the normal case where module loading succeeds, and we free
the init section.
We don't want a synchronize_rcu() there, because it would slow down
module loading: this bug was introduced in 2009 to speed module
loading in the first place.
Thus, we want to do the free in an RCU callback. We do this in the
simplest possible way by allocating a new rcu_head: if we put it in
the module structure we'd have to worry about that getting freed.
Reported-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Nothing needs the module pointer any more, and the next patch will
call it from RCU, where the module itself might no longer exist.
Removing the arg is the safest approach.
This just codifies the use of the module_alloc/module_free pattern
which ftrace and bpf use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: x86@kernel.org
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: nios2-dev@lists.rocketboards.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: netdev@vger.kernel.org
Archs have been abusing module_free() to clean up their arch-specific
allocations. Since module_free() is also (ab)used by BPF and trace code,
let's keep it to simple allocations, and provide a hook called before
that.
This means that avr32, ia64, parisc and s390 no longer need to implement
their own module_free() at all. avr32 doesn't need module_finalize()
either.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
ignore_lockdep is uninitialized, and sysfs_attr_init() doesn't initialize
it, so memset to 0.
Reported-by: Huang Ying <ying.huang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch fixes two separate buglets in calls to futex_lock_pi():
* Eliminate unused 'detect' argument
* Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL
The 'detect' argument of futex_lock_pi() seems never to have been
used (when it was included with the initial PI mutex implementation
in Linux 2.6.18, all checks against its value were disabled by
ANDing against 0 (i.e., if (detect... && 0)), and with
commit 778e9a9c3e, any mention of
this argument in futex_lock_pi() went way altogether. Its presence
now serves only to confuse readers of the code, by giving the
impression that the futex() FUTEX_LOCK_PI operation actually does
use the 'val' argument. This patch removes the argument.
The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes
'timeout' as one of its arguments. This misleads the reader into thinking
that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible
purpose; but it does not. Indeed, it cannot, because the checks at the
start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations
that do copy_from_user() on the timeout argument. So, in the
FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change
'timeout' to 'NULL'. This patch does that.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Darren Hart <darren@dvhart.com>
Link: http://lkml.kernel.org/r/54B96646.8010200@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid overflow possibility.
[ The overflow is purely theoretical, since this is used for memory
ranges that aren't even close to using the full 64 bits, but this is
the right thing to do regardless. - Linus ]
Signed-off-by: Louis Langholtz <lou_langholtz@me.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Anvin <hpa@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
the mixture of function graph tracing and kprobes.
When jprobes and function graph tracing is enabled at the same time
it will crash the system.
# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
After the first fork (jprobe_example probes it), the system will crash.
This is due to the way jprobes copies the stack frame and does not
do a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack
and replaces it with a hook function. It saves the return addresses in
a separate stack to put back the correct return address when done.
But because the jprobe functions do not do a normal return, their
stack addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.
The simple fix here was to disable function graph tracing while the
jprobe handler is being called.
While debugging this I found two minor bugs with the function graph
tracing.
The first was about the function graph tracer sharing its function hash
with the function tracer (they both get filtered by the same input).
The changing of the set_ftrace_filter would not sync the function recording
records after a change if the function tracer was disabled but the
function graph tracer was enabled. This was due to the update only checking
one of the ops instead of the shared ops to see if they were enabled and
should perform the sync. This caused the ftrace accounting to break and
a ftrace_bug() would be triggered, disabling ftrace until a reboot.
The second was that the check to update records only checked one of the
filter hashes. It needs to test both the "filter" and "notrace" hashes.
The "filter" hash determines what functions to trace where as the "notrace"
hash determines what functions not to trace (trace all but these).
Both hashes need to be passed to the update code to find out what change
is being done during the update. This also broke the ftrace record
accounting and triggered a ftrace_bug().
This patch set also include two more fixes that were reported separately
from the kprobe issue.
One was that init_ftrace_syscalls() was called twice at boot up.
This is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the first
one a memory leak, and wastes memory.
The other fix is a regression caused by an update in the v3.19 merge window.
The moving to enable events early, moved the enabling before PID 1 was
created. The syscall events require setting the TIF_SYSCALL_TRACEPOINT
for all tasks. But for_each_process_thread() does not include the swapper
task (PID 0), and ended up being a nop. A suggested fix was to add
the init_task() to have its flag set, but I didn't really want to mess
with PID 0 for this minor bug. Instead I disable and re-enable events again
at early_initcall() where it use to be enabled. This also handles any other
event that might have its own reg function that could break at early
boot up.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUt9vmAAoJEEjnJuOKh9ldLHEIAJ9XrPW2xMIY5yI69jT1F7pv
PkSRqENnOK0l4UulD52SvIBecQTTBcEEjao4yVGkc7DCJBOws/1LZ5gW8OfNlKjq
rMB8yaosL1tXJ1ARVPMjcQVy+228zkgTXznwEZCjku1g7LuScQ28qyXsXO7B6yiK
xKoHqKjygmM/a2aVn+8tdiVKiDp6jdmkbYicbaFT4xP7XB5DaMmIiXRHxdvW6xdR
azKrVfYiMyJqTZNt/EVSWUk2WjeaYhoXyNtvgPx515wTo/llCnzhjcsocXBtH2P/
YOtwl+1L7Z89ukV9oXqrtrUJZ6Ps7+g7I1flJuL7/1FlNGnklcP9JojD+t6HeT8=
=vkec
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fixes from Steven Rostedt:
"This holds a few fixes to the ftrace infrastructure as well as the
mixture of function graph tracing and kprobes.
When jprobes and function graph tracing is enabled at the same time it
will crash the system:
# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
After the first fork (jprobe_example probes it), the system will
crash.
This is due to the way jprobes copies the stack frame and does not do
a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack and
replaces it with a hook function. It saves the return addresses in a
separate stack to put back the correct return address when done. But
because the jprobe functions do not do a normal return, their stack
addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.
The simple fix here was to disable function graph tracing while the
jprobe handler is being called.
While debugging this I found two minor bugs with the function graph
tracing.
The first was about the function graph tracer sharing its function
hash with the function tracer (they both get filtered by the same
input). The changing of the set_ftrace_filter would not sync the
function recording records after a change if the function tracer was
disabled but the function graph tracer was enabled. This was due to
the update only checking one of the ops instead of the shared ops to
see if they were enabled and should perform the sync. This caused the
ftrace accounting to break and a ftrace_bug() would be triggered,
disabling ftrace until a reboot.
The second was that the check to update records only checked one of
the filter hashes. It needs to test both the "filter" and "notrace"
hashes. The "filter" hash determines what functions to trace where as
the "notrace" hash determines what functions not to trace (trace all
but these). Both hashes need to be passed to the update code to find
out what change is being done during the update. This also broke the
ftrace record accounting and triggered a ftrace_bug().
This patch set also include two more fixes that were reported
separately from the kprobe issue.
One was that init_ftrace_syscalls() was called twice at boot up. This
is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the
first one a memory leak, and wastes memory.
The other fix is a regression caused by an update in the v3.19 merge
window. The moving to enable events early, moved the enabling before
PID 1 was created. The syscall events require setting the
TIF_SYSCALL_TRACEPOINT for all tasks. But for_each_process_thread()
does not include the swapper task (PID 0), and ended up being a nop.
A suggested fix was to add the init_task() to have its flag set, but I
didn't really want to mess with PID 0 for this minor bug. Instead I
disable and re-enable events again at early_initcall() where it use to
be enabled. This also handles any other event that might have its own
reg function that could break at early boot up"
* tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix enabling of syscall events on the command line
tracing: Remove extra call to init_ftrace_syscalls()
ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
ftrace: Check both notrace and filter for old hash
ftrace: Fix updating of filters for shared global_ops filters
The current tiny RCU stall-warning code assumes that the jiffies counter
starts at zero, however, it is sometimes initialized to other values,
for example, -30,000. This commit therefore changes rcu_init() to
invoke reset_cpu_stall_ticks() for both flavors of RCU to initialize
the stall-warning times properly at boot.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The tiny RCU CPU stall detection depends on *rcp->curtail not being
NULL. It is however a tail pointer and thus NULL by definition. Instead we
should check rcp->rcucblist for the presence of pending callbacks which
need to be processed. With this fix INFO about the stall is printed and
jiffies_stall (jiffies at next stall) correctly updated.
Note that the check for pending callback is necessary to avoid spurious
warnings if there are no pendings callbacks.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
[ paulmck: Fused identical "if" statements, ported to -rcu. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds a message that is printed if the relevant grace-period
kthread has not been able to run for the two seconds preceding the
stall warning. (The two seconds is double the maximum interval between
successive bouts of quiescent-state forcing.)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used
in places where it would be useful for it to apply to the normal RCU
flavors, rcu_preempt, rcu_sched, and rcu_bh. This is especially the
case for workloads that aggressively overload the system, particularly
those that generate large numbers of RCU updates on systems running
NO_HZ_FULL CPUs. This commit therefore communicates quiescent states
from cond_resched_rcu_qs() to the normal RCU flavors.
Note that it is unfortunately necessary to leave the old ->passed_quiesce
mechanism in place to allow quiescent states that apply to only one
flavor to be recorded. (Yes, we could decrement ->rcu_qs_ctr_snap in
that case, but that is not so good for debugging of RCU internals.)
In addition, if one of the RCU flavor's grace period has stalled, this
will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight
quiescent state visible from other CPUs.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu()
was used in preemptible code. ]
Recent testing has shown that under heavy load, running RCU's grace-period
kthreads at real-time priority can improve performance (according to 0day
test robot) and reduce the incidence of RCU CPU stall warnings. However,
most systems do just fine with the default non-realtime priorities for
these kthreads, and it does not make sense to expose the entire user
base to any risk stemming from this change, given that this change is
of use only to a few users running extremely heavy workloads.
Therefore, this commit allows users to specify realtime priorities
for the grace-period kthreads, but leaves them running SCHED_OTHER
by default. The realtime priority may be specified at build time
via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
rcutree.kthread_prio parameter. Either way, 0 says to continue the
default SCHED_OTHER behavior and values from 1-99 specify that priority
of SCHED_FIFO behavior. Note that a value of 0 is not permitted when
the RCU_BOOST Kconfig parameter is specified.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 5f893b2639 "tracing: Move enabling tracepoints to just after
rcu_init()" broke the enabling of system call events from the command
line. The reason was that the enabling of command line trace events
was moved before PID 1 started, and the syscall tracepoints require
that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the
swapper task (pid 0) is not part of that. Since the swapper task is the
only task that is running at this early in boot, no task gets the
flag set, and the tracepoint never gets reached.
Instead of setting the swapper task flag (there should be no reason to
do that), re-enabled trace events again after the init thread (PID 1)
has been started. It requires disabling all command line events and
re-enabling them, as just enabling them again will not reset the logic
to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will
be fooled into thinking that it was already set, and wont try setting
it again. For this reason, we must first disable it and re-enable it.
Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.org
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_init() calls init_ftrace_syscalls() and then calls trace_event_init()
which also calls init_ftrace_syscalls(). It makes more sense to only
call it from trace_event_init().
Calling it twice wastes memory, as it allocates the syscall events twice,
and loses the first copy of it.
Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com
Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.org
Reported-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using just the filter for checking for trampolines or regs is not enough
when updating the code against the records that represent all functions.
Both the filter hash and the notrace hash need to be checked.
To trigger this bug (using trace-cmd and perf):
# perf probe -a do_fork
# trace-cmd start -B foo -e probe
# trace-cmd record -p function_graph -n do_fork sleep 1
The trace-cmd record at the end clears the filter before it disables
function_graph tracing and then that causes the accounting of the
ftrace function records to become incorrect and causes ftrace to bug.
Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org
Cc: stable@vger.kernel.org
[ still need to switch old_hash_ops to old_ops_hash ]
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the set_ftrace_filter affects both the function tracer as well as the
function graph tracer, the ops that represent each have a shared
ftrace_ops_hash structure. This allows both to be updated when the filter
files are updated.
But if function graph is enabled and the global_ops (function tracing) ops
is not, then it is possible that the filter could be changed without the
update happening for the function graph ops. This will cause the changes
to not take place and may even cause a ftrace_bug to occur as it could mess
with the trampoline accounting.
The solution is to check if the ops uses the shared global_ops filter and
if the ops itself is not enabled, to check if there's another ops that is
enabled and also shares the global_ops filter. In that case, the
modification still needs to be executed.
Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org
Cc: stable@vger.kernel.org # 3.17+
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify run_ksoftirqd() by using the new cond_resched_rcu_qs() function
that conditionally reschedules, but unconditionally supplies an RCU
quiescent state. This commit is separate from the previous commit by
Calvin Owens because Calvin's approach can be backported, while this
commit cannot be. The reason that this commit cannot be backported is
that cond_resched_rcu_qs() does not always provide the needed quiescent
state in earlier kernels.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
While debugging an issue with excessive softirq usage, I encountered the
following note in commit 3e339b5dae ("softirq: Use hotplug thread
infrastructure"):
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]
...but despite this note, the patch still calls RCU with IRQs disabled.
This seemingly innocuous change caused a significant regression in softirq
CPU usage on the sending side of a large TCP transfer (~1 GB/s): when
introducing 0.01% packet loss, the softirq usage would jump to around 25%,
spiking as high as 50%. Before the change, the usage would never exceed 5%.
Moving the call to rcu_note_context_switch() after the cond_sched() call,
as it was originally before the hotplug patch, completely eliminated this
problem.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
While debugging some "sleeping function called from invalid context" bug I
realized that the debugging message "Preemption disabled at:" pointed to
an incorrect function.
In particular if the last function/action that disabled preemption was
spin_lock_bh() then current->preempt_disable_ip won't be updated.
The reason for this is that __local_bh_disable_ip() will increase
preempt_count manually instead of calling preempt_count_add(), which
would handle the update correctly.
It look like the manual handling was done to work around some lockdep issue.
So add the missing update of current->preempt_disable_ip to
__local_bh_disable_ip() as well.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150107090441.GC4365@osiris
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both mutexes and rwsems took a performance hit when we switched
over from the original mcs code to the cancelable variant (osq).
The reason being the use of smp_load_acquire() when polling for
node->locked. This is not needed as reordering is not an issue,
as such, relax the barrier semantics. Paul describes the scenario
nicely: https://lkml.org/lkml/2013/11/19/405
- If we start polling before the insertion is complete, all that
happens is that the first few polls have no chance of seeing a lock
grant.
- Ordering the polling against the initialization -- the above
xchg() is already doing that for us.
The smp_load_acquire() when unqueuing make sense. In addition,
we don't need to worry about leaking the critical region as
osq is only used internally.
This impacts both regular and large levels of concurrency,
ie on a 40 core system with a disk intensive workload:
disk-1 804.83 ( 0.00%) 828.16 ( 2.90%)
disk-61 8063.45 ( 0.00%) 18181.82 (125.48%)
disk-121 7187.41 ( 0.00%) 20119.17 (179.92%)
disk-181 6933.32 ( 0.00%) 20509.91 (195.82%)
disk-241 6850.81 ( 0.00%) 20397.80 (197.74%)
disk-301 6815.22 ( 0.00%) 20287.58 (197.68%)
disk-361 7080.40 ( 0.00%) 20205.22 (185.37%)
disk-421 7076.13 ( 0.00%) 19957.33 (182.04%)
disk-481 7083.25 ( 0.00%) 19784.06 (179.31%)
disk-541 7038.39 ( 0.00%) 19610.92 (178.63%)
disk-601 7072.04 ( 0.00%) 19464.53 (175.23%)
disk-661 7010.97 ( 0.00%) 19348.23 (175.97%)
disk-721 7069.44 ( 0.00%) 19255.33 (172.37%)
disk-781 7007.58 ( 0.00%) 19103.14 (172.61%)
disk-841 6981.18 ( 0.00%) 18964.22 (171.65%)
disk-901 6968.47 ( 0.00%) 18826.72 (170.17%)
disk-961 6964.61 ( 0.00%) 18708.02 (168.62%)
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420573509-24774-7-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.
The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.
Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.
This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.
Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.
We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).
One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have two flavors of the MCS spinlock: standard and cancelable (OSQ).
While each one is independent of the other, we currently mix and match
them. This patch:
- Moves the OSQ code out of mcs_spinlock.h (which only deals with the traditional
version) into include/linux/osq_lock.h. No unnecessary code is added to the
more global header file, anything locks that make use of OSQ must include
it anyway.
- Renames mcs_spinlock.c to osq_lock.c. This file only contains osq code.
- Introduces a CONFIG_LOCK_SPIN_ON_OWNER in order to only build osq_lock
if there is support for it.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1420573509-24774-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
... which is equivalent to the fastpath counter part.
This mainly allows getting some WW specific code out
of generic mutex paths.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420573509-24774-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It serves much better if the comments are right before the osq_lock() call.
Also delete a useless comment.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420573509-24774-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark it so by renaming __mutex_lock_check_stamp().
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420573509-24774-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The original purpose of rq::skip_clock_update was to avoid 'costly' clock
updates for back to back wakeup-preempt pairs. The big problem with it
has always been that the rq variable is unaware of the context and
causes indiscrimiate clock skips.
Rework the entire thing and create a sense of context by only allowing
schedule() to skip clock updates. (XXX can we measure the cost of the
added store?)
By ensuring only schedule can ever skip an update, we guarantee we're
never more than 1 tick behind on the update.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->clock{,_task} are serialized by rq->lock, verify this.
One immediate fail is the usage in scale_rt_capability, so 'annotate'
that for now, there's more 'funny' there. Maybe change rq->lock into a
raw_seqlock_t?
(Only 32-bit is affected)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Search all usage of p->sched_class in sched/core.c, no one check it
before use, so it seems that every task must belong to one sched_class.
Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
[ Moved the early class assignment to make it boot. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1419835303-28958-1-git-send-email-yaodongdong@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Child has the same decay_count as parent. If it's not zero,
we add it to parent's cfs_rq->removed_load:
wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair().
Child's load is a just garbade after copying of parent,
it hasn't been on cfs_rq yet, and it must not be added to
cfs_rq::removed_load in migrate_task_rq_fair().
The patch moves sched_entity::avg::decay_count intialization
in sched_fork(). So, migrate_task_rq_fair() does not change
removed_load.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
"struct task_struct"->state is "volatile long" and __ffs() warns that
"Undefined if no bit exists, so code should check against 0 first."
Therefore, at expression
state = p->state ? __ffs(p->state) + 1 : 0;
in sched_show_task(), CPU might see "p->state" before "?" as "non-zero"
but "p->state" after "?" as "zero", which could result in
"state >= sizeof(stat_nam)" being true and bogus '?' is printed.
This patch changes "state" from "unsigned int" to "unsigned long" and
save "p->state" before calling __ffs(), in order to avoid potential call
to __ffs(0).
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/201412052131.GCE35924.FVHFOtLOJOMQFS@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sometimes a "BUG: sleeping function called from invalid context"
message is not indicative of locking problems, but is the result
of a stack overflow corrupting the thread info.
Witness http://oss.sgi.com/archives/xfs/2014-02/msg00325.html
for example, which took a few go-rounds to sort out.
If we're printing the warning, things are wonky already, and
it'd be informative to check for the stack end corruption at this
point, too.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/5490B158.4060005@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In __synchronize_entity_decay(), if "decays" happens to be zero,
se->avg.decay_count will not be zeroed, holding the positive value
assigned when dequeued last time.
This is problematic in the following case:
If this runnable task is CFS-balanced to other CPUs soon afterwards,
migrate_task_rq_fair() will treat it as a blocked task due to its
non-zero decay_count, thereby adding its load to cfs_rq->removed_load
wrongly.
Thus, we must zero se->avg.decay_count in this case as well.
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418745509-2609-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pass the original kprobe for preparing an optimized kprobe arch-dep
part, since for some architecture (e.g. ARM32) requires the information
in original kprobe.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: group scheduling corner case fix, two deadline scheduler
fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs
system fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()
sched/deadline: Avoid double-accounting in case of missed deadlines
sched/deadline: Fix migration of SCHED_DEADLINE tasks
sched: Fix odd values in effective_load() calculations
sched, fanotify: Deal with nested sleeps
sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also some kernel side fixes: uncore PMU
driver fix, user regs sampling fix and an instruction decoder fix that
unbreaks PEBS precise sampling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes
perf/x86_64: Improve user regs sampling
perf: Move task_pt_regs sampling into arch code
x86: Fix off-by-one in instruction decoder
perf hists browser: Fix segfault when showing callchain
perf callchain: Free callchains when hist entries are deleted
perf hists: Fix children sort key behavior
perf diff: Fix to sort by baseline field by default
perf list: Fix --raw-dump option
perf probe: Fix crash in dwarf_getcfi_elf
perf probe: Fix to fall back to find probe point in symbols
perf callchain: Append callchains only when requested
perf ui/tui: Print backtrace symbols when segfault occurs
perf report: Show progress bar for output resorting
Pull locking fixes from Ingo Molnar:
"A liblockdep fix and a mutex_unlock() mutex-debugging fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mutex: Always clear owner field upon mutex_unlock()
tools/liblockdep: Fix debug_check thinko in mutex destroy
Currently, rcutorture's Reader Batch checks measure from the end of
the previous grace period to the end of the current one. This commit
tightens up these checks by measuring from the start and end of the same
grace period. This involves adding rcu_batches_started() and friends
corresponding to the existing rcu_batches_completed() and friends.
We leave SRCU alone for the moment, as it does not yet have a way of
tracking both ends of its grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the return type of rcu_batches_completed() and friends matches
that of the rcu_torture_ops structure's ->completed field, the wrapper
functions can be deleted. This commit carries out that deletion, while
also wiring "sched"'s ->completed field to rcu_batches_completed_sched().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The counter returned by the various ->completed functions is subject to
overflow, which means that subtracting two such counters might result
in overflow, which invokes undefined behavior in the C standard. This
commit therefore changes these functions and variables to unsigned to
avoid this undefined behavior.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Long ago, the various ->completed fields were of type long, but now are
unsigned long due to signed-integer-overflow concerns. However, the
various _batches_completed() functions remained of type long, even though
their only purpose in life is to return the corresponding ->completed
field. This patch cleans this up by changing these functions' return
types to unsigned long.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cleanups
kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
kgdb/kdb: Allow access on a single core, if a CPU round up is deemed
impossible, which will allow inspection of the now "trashed" kernel
kdb: Add enable mask for the command groups
kdb: access controls to restrict sensitive commands
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUrq8WAAoJEIciOldedpOj+C8P/AjSUVBZdBLWzCU2VG150sQ0
UacwFVLve9heoColHBF7VqIDCRkZokIKJmCbHUBPZTbs22auLRpNI+D6CY5lZD17
jEHxrkKY4ragRRc/W3Y1MSc3aeGnS0i5AR8PJermMWxyUBfN3FBxgFHzTaLB2ZTT
8A+tvmwiG4mHue52gSiYZPCl/52WWOh+NjDe7T9OZ+mNmQKwZ5ssQZmmyUkxrs3b
LKXVXVtTUXxfEgB2x+lYTYAztcTsM5h+NbkT74FpSmwPjvU/p81Ptqveh+3JTdmX
H+Jz/SqD1/NfxC1Eenh5Mc++p/UVxeRbBulV9jwqjOyJqDjw3qHs1cjm8tZZj1qG
J3LODKi3GWhujMCfwdu5EJRnrFxgHCPiWInc2708oLbRi5SyOe6P6hNQ3K3Y4JtF
VkYa62wSaI0fDNQUFRc3bXUOUdMOCXjuzw3BtTi93tcUNcQwCXuYCmWtVvBgmK1h
LTrFCJmzbopiwpomxCwZ4BQm8id9HxP5pod95ypYb8K5aheXHCuSgibqj0nswWMm
ix0YTd4UNTn79r6p4d0fXFjOOYpXZA80ojeVI27D9zW7dBYc5CGVA1IDNH0ZfiPo
qySPUNUMXIjiTSOGZdUehByEC7tliLZczelRPnNh/9fmhJkJ745S7zs3DNQ7Ypg4
xDKthlRGNjn6cXOPl7gX
=cf1c
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kgdb/kdb fixes from Jason Wessel:
"These have been around since 3.17 and in kgdb-next for the last 9
weeks and some will go back to -stable.
Summary of changes:
Cleanups
- kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
- kgdb/kdb: Allow access on a single core, if a CPU round up is
deemed impossible, which will allow inspection of the now "trashed"
kernel
- kdb: Add enable mask for the command groups
- kdb: access controls to restrict sensitive commands"
* tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kernel/debug/debug_core.c: Logging clean-up
kgdb: timeout if secondary CPUs ignore the roundup
kdb: Allow access to sensitive commands to be restricted by default
kdb: Add enable mask for groups of commands
kdb: Categorize kdb commands (similar to SysRq categorization)
kdb: Remove KDB_REPEAT_NONE flag
kdb: Use KDB_REPEAT_* values as flags
kdb: Rename kdb_register_repeat() to kdb_register_flags()
kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
kdb: Remove currently unused kdbtab_t->cmd_flags
When applying multiple patches to a module, if the module is loaded
after the patches are loaded, the patches are applied in reverse order:
$ insmod patch1.ko
[ 43.172992] livepatch: enabling patch 'patch1'
$ insmod patch2.ko
[ 46.571563] livepatch: enabling patch 'patch2'
$ modprobe nfsd
[ 52.888922] livepatch: applying patch 'patch2' to loading module 'nfsd'
[ 52.899847] livepatch: applying patch 'patch1' to loading module 'nfsd'
Fix the loading order by storing the klp_patches list in queue order.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
CONFIG_GCOV_FORMAT_3_4 / _4_7 / _AUTODETECT are exclusive.
Compare the CC version only when _AUTODETECT is enabled.
This change should have no impact.
Signed-off-by: Masahiro Yamada <yamada.m@jp.panasonic.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Kbuild descends into kernel/gcov/ directory only when
CONFIG_GCOV_KERNEL is enabled. (See kernel/Makefile)
CONFIG_GCOV_KERNEL check can be omitted in kernel/gcov/Makefile.
Signed-off-by: Masahiro Yamada <yamada.m@jp.panasonic.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Since commit 371fdc77af (kbuild: collect shorthands into
scripts/Kbuild.include), scripts/Makefile.clean includes
scripts/Kbuild.include.
The workaround and the comment block in kernel/gcov/Makefile
are no longer necessary.
Signed-off-by: Masahiro Yamada <yamada.m@jp.panasonic.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
The macros cc-version, cc-fullversion and ld-version take no argument.
It is not necessary to add $(call ...) to invoke them.
Signed-off-by: Masahiro Yamada <yamada.m@jp.panasonic.com>
Acked-by: Helge Deller <deller@gmx.de> [parisc]
Signed-off-by: Michal Marek <mmarek@suse.cz>
Currently if DEBUG_MUTEXES is enabled, the mutex->owner field is only
cleared iff debug_locks is active. This exposes a race to other users of
the field where the mutex->owner may be still set to a stale value,
potentially upsetting mutex_spin_on_owner() among others.
References: https://bugs.freedesktop.org/show_bug.cgi?id=87955
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420540175-30204-1-git-send-email-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is <= 0. However, it also checks if the
scheduling deadline has been missed (the current time is
larger than the current scheduling deadline), further
decreasing the runtime if this happens.
This "double accounting" is wrong:
- In case of partitioned scheduling (or single CPU), this
happens if task_tick_dl() has been called later than expected
(due to small HZ values). In this case, the current runtime is
also negative, and replenish_dl_entity() can take care of the
deadline miss by recharging the current runtime to a value smaller
than dl_runtime
- In case of global scheduling on multiple CPUs, scheduling
deadlines can be missed even if the task did not consume more
runtime than expected, hence penalizing the task is wrong
This patch fix this problem by throttling a SCHED_DEADLINE task
only when its runtime becomes negative, and not modifying the runtime
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, later_rq->cpu);
activate_task(later_rq, next_task, 0);
which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().
enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.
As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2
This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86_64, at least, task_pt_regs may be only partially initialized
in many contexts, so x86_64 should not use it without extra care
from interrupt context, let alone NMI context.
This will allow x86_64 to override the logic and will supply some
scratch space to use to make a cleaner copy of user regs.
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: chenggang.qcg@taobao.com
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are aborting a build in case when gcc doesn't support fentry on x86_64
(regs->ip modification can't really reliably work with mcount).
This however breaks allmodconfig for people with older gccs that don't
support -mfentry.
Turn the build-time failure into runtime failure, resulting in the whole
infrastructure not being initialized if CC_USING_FENTRY is unset.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
change in between, gcc needs to reload p->exit_state after
security_task_wait(). In this case ->notask_error will be wrongly
cleared and do_wait() can hang forever if it was the last eligible
child.
Many thanks to Arne who carefully investigated the problem.
Note: this bug is very old but it was pure theoretical until commit
b3ab03160d ("wait: completely ignore the EXIT_DEAD tasks"). Before
this commit "-O2" was probably enough to guarantee that compiler won't
read ->exit_state twice.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Arne Goedeke <el@laramies.com>
Tested-by: Arne Goedeke <el@laramies.com>
Cc: <stable@vger.kernel.org> [3.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Verify that the frequency value from userspace is valid and makes sense.
Unverified values can cause overflows later on.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Fix up bug for negative values and drop redunent cap check]
Signed-off-by: John Stultz <john.stultz@linaro.org>
An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap. For
example, consider the following sequence of events:
o CPU 1 encounters a quiescent state while waiting for grace period
5 to complete, but then enters dyntick-idle mode.
o While CPU 1 is in dyntick-idle mode, the grace-period counters
wrap around so that the grace period number is now 4.
o Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
and grace period 5 begins.
o The quiescent state that CPU 1 passed through during the old
grace period 5 looks like it applies to the new grace period
5. Therefore, the new grace period 5 completes without CPU 1
having passed through a quiescent state.
This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time. At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems. However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible. It is therefore time to
close this race.
This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current RCU CPU stall warning code will print "Stall ended before
state dump start" any time that the stall-warning code is triggered on
a CPU that has already reported a quiescent state for the current grace
period and if all quiescent states have been reported for the current
grace period. However, a true stall can result in these symptoms, for
example, by preventing RCU's grace-period kthreads from ever running
This commit therefore checks for this condition, reporting the end of
the stall only if one of the grace-period counters has actually advanced.
Otherwise, it reports the last time that the grace-period kthread made
meaningful progress. (In normal situations, the grace-period kthread
should make meaningful progress at least every jiffies_till_next_fqs
jiffies.)
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Miroslav Benes <mbenes@suse.cz>
One way that an RCU CPU stall warning can happen is if the grace-period
kthread is not allowed to execute. One proxy for this kthread's
forward progress is the number of force-quiescent-state (fqs) scans.
This commit therefore adds the number of fqs scans to the RCU CPU stall
warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.
The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.
If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.o
Size of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after
so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
When rcutorture used only the low-order 32 bits of the grace-period
number, it was not a problem for SRCU to use a 32-bit completed field.
However, rcutorture now uses the full 64 bits on 64-bit systems, so
this commit converts SRCU's ->completed field to unsigned long so as to
provide 64 bits on 64-bit systems.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The RCU callback lists are initialized in both rcu_boot_init_percpu_data()
and rcu_init_percpu_data(). The former is intended for initializing
immutable data, so this commit removes the initialization from
rcu_boot_init_percpu_data() and leaves it in rcu_init_percpu_data().
This change prepares for permitting callbacks to be queued very early
in boot.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that blocked tasks are no longer migrated to the root rcu_node
structure, there is no need to scan the root rcu_node structure for
blocked tasks stalling the current grace period. This commit therefore
removes this scan.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The patch dfeb9765ce ("Allow post-unlock reference for rt_mutex")
ensured rcu-boost safe even the rt_mutex has post-unlock reference.
But rt_mutex allowing post-unlock reference is definitely a bug and it was
fixed by the commit 27e35715df ("rtmutex: Plug slow unlock race").
This fix made the previous patch (dfeb9765ce) useless.
And even worse, the priority-inversion introduced by the the previous
patch still exists.
rcu_read_unlock_special() {
rt_mutex_unlock(&rnp->boost_mtx);
/* Priority-Inversion:
* the current task had been deboosted and preempted as a low
* priority task immediately, it could wait long before reschedule in,
* and the rcu-booster also waits on this low priority task and sleeps.
* This priority-inversion makes rcu-booster can't work
* as expected.
*/
complete(&rnp->boost_completion);
}
Just revert the patch to avoid it.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_cleanup_dead_cpu() function (called after a CPU has gone
completely offline) has not reported a quiescent state because there
was probably at least one synchronize_rcu() between the time the CPU
went offline and the CPU_DEAD notifier, and this would have detected
the CPU's offline state via quiescent-state forcing. However, the plan
is for CPUs to take themselves offline, at which point it makes sense
for them to report their own quiescent state. This commit makes this
change in preparation for the new CPU-hotplug setup.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
rcu_node structure are now offline, it affinities the corresponding
RCU-boost ("rcub") kthread away from those CPUs. This is pointless
because the kthread cannot run on those offline CPUs in any case.
This commit therefore removes this unneeded code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because there is no longer any preempted tasks on the root rcu_node, and
because there is no longer ever an rcub kthread for the root rcu_node,
this commit drops the code in force_qs_rnp() that attempts to awaken
the non-existent root rcub kthread. This is strictly a performance
enhancement, removing a root rcu_node ->lock acquisition and release
along with some tests in rcu_initiate_boost(), ending with the test that
notes that there is no rcub kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that offlining CPUs no longer moves leaf rcu_node structures'
->blkd_tasks lists to the root, there is no way for the root rcu_node
structure's ->blkd_task list to be nonempty, unless the root node is also
the sole leaf node. This commit therefore refrains from creating an rcub
kthread for the root rcu_node structure unless it is also the sole leaf.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Given that there is now arcu_preempt_has_tasks() function that checks
to see if the ->blkd_tasks list is non-empty, this commit makes use of it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that we are not migrating callbacks, there is no need to hold the
->orphan_lock across the the ->qsmaskinit bit-clearing process.
This commit therefore releases ->orphan_lock immediately after adopting
the orphaned RCU callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When the last CPU associated with a given leaf rcu_node structure
goes offline, something must be done about the tasks queued on that
rcu_node structure. Each of these tasks has been preempted on one of
the leaf rcu_node structure's CPUs while in an RCU read-side critical
section that it have not yet exited. Handling these tasks is the job of
rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
structure to the root rcu_node structure.
Unfortunately, this migration has to be done one task at a time because
each tasks allegiance must be shifted from the original leaf rcu_node to
the root, so that future attempts to deal with these tasks will acquire
the root rcu_node structure's ->lock rather than that of the leaf.
Worse yet, this migration must be done with interrupts disabled, which
is not so good for realtime response, especially given that there is
no bound on the number of tasks on a given rcu_node structure's list.
(OK, OK, there is a bound, it is just that it is unreasonably large,
especially on 64-bit systems.) This was not considered a problem back
when rcu_preempt_offline_tasks() was first written because realtime
systems were assumed not to do CPU-hotplug operations while real-time
applications were running. This assumption has proved of dubious validity
given that people are starting to run multiple realtime applications
on a single SMP system and that it is common practice to offline then
online a CPU before starting its real-time application in order to clear
extraneous processing off of that CPU. So we now need CPU hotplug
operations to avoid undue latencies.
This commit therefore avoids migrating these tasks, instead letting
them be dequeued one by one from the original leaf rcu_node structure
by rcu_read_unlock_special(). This means that the clearing of bits
from the upper-level rcu_node structures must be deferred until the
last such task has been dequeued, because otherwise subsequent grace
periods won't wait on them. This commit has the beneficial side effect
of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
CONFIG_RCU_BOOST builds.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
bit clearing up the rcu_node tree once a given rcu_node structure's
blkd_tasks list becomes empty. This is the final commit in preparation
for the rework of RCU priority boosting: It enables preempted tasks to
remain queued on their rcu_node structure even after all of that rcu_node
structure's CPUs have gone offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
in preparation for the rework of RCU priority boosting. This new function
will be invoked from rcu_read_unlock_special() in the reworked scheme,
which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
structure's ->qsmaskinit field has already been updated.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit undertakes a simple variable renaming to make way for
some rework of RCU priority boosting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit prevents random compiler optimizations by applying
ACCESS_ONCE() to lockless accesses.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
removed the irq_work_queue(), so the TREE_RCU doesn't need
irq work any more. This commit therefore updates RCU's Kconfig and
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU. However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another. It turns out
that recent versions of rcutorture can spot this race.
This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked. Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit b2c4623dcd ("rcu: More on deadlock between CPU hotplug and expedited
grace periods") introduced another problem that can easily be reproduced by
starting/stopping cpus in a loop.
E.g.:
for i in `seq 5000`; do
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu1/online
done
Will result in:
INFO: task /cpu_start_stop:1 blocked for more than 120 seconds.
Call Trace:
([<00000000006a028e>] __schedule+0x406/0x91c)
[<0000000000130f60>] cpu_hotplug_begin+0xd0/0xd4
[<0000000000130ff6>] _cpu_up+0x3e/0x1c4
[<0000000000131232>] cpu_up+0xb6/0xd4
[<00000000004a5720>] device_online+0x80/0xc0
[<00000000004a57f0>] online_store+0x90/0xb0
...
And a deadlock.
Problem is that if the last ref in put_online_cpus() can't get the
cpu_hotplug.lock the puts_pending count is incremented, but a sleeping
active_writer might never be woken up, therefore never exiting the loop in
cpu_hotplug_begin().
This fix removes puts_pending and turns refcount into an atomic variable. We
also introduce a wait queue for the active_writer, to avoid possible races and
use-after-free. There is no need to take the lock in put_online_cpus() anymore.
Can't reproduce it with this fix.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
For RCU in UP, context-switch = QS = GP, thus we can force a
context-switch when any call_rcu_[bh|sched]() is happened on idle_task.
After doing so, rcu_idle/irq_enter/exit() are useless, so we can simply
make these functions empty.
More important, this change does not change the functionality logically.
Note: raise_softirq(RCU_SOFTIRQ)/rcu_sched_qs() in rcu_idle_enter() and
outmost rcu_irq_exit() will have to wake up the ksoftirqd
(due to in_interrupt() == 0).
Before this patch After this patch:
call_rcu_sched() in idle; call_rcu_sched() in idle
set resched
do other stuffs; do other stuffs
outmost rcu_irq_exit() outmost rcu_irq_exit() (empty function)
(or rcu_idle_enter()) (or rcu_idle_enter(), also empty function)
start to resched. (see above)
rcu_sched_qs() rcu_sched_qs()
QS,and GP and advance cb QS,and GP and advance cb
wake up the ksoftirqd wake up the ksoftirqd
set resched
resched to ksoftirqd (or other) resched to ksoftirqd (or other)
These two code patches are almost the same.
Size changed after patched:
size kernel/rcu/tiny-old.o kernel/rcu/tiny-patched.o
text data bss dec hex filename
3449 206 8 3663 e4f kernel/rcu/tiny-old.o
2406 144 8 2558 9fe kernel/rcu/tiny-patched.o
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull audit fix from Paul Moore:
"One audit patch to resolve a panic/oops when recording filenames in
the audit log, see the mail archive link below.
The fix isn't as nice as I would like, as it involves an allocate/copy
of the filename, but it solves the problem and the overhead should
only affect users who have configured audit rules involving file
names.
We'll revisit this issue with future kernels in an attempt to make
this suck less, but in the meantime I think this fix should go into
the next release of v3.19-rcX.
[ https://marc.info/?t=141986927600001&r=1&w=2 ]"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: create private file name copies when auditing inodes
Despite what the comment says, it is only softirqs that are disabled,
not interrupts. This commit therefore fixes the comment.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Let's start assuming that something in the idle loop posts a callback,
and scheduling-clock interrupt occurs:
1. The system is idle and stays that way, no runnable tasks.
2. Scheduling-clock interrupt occurs, rcu_check_callbacks() is called
as result, which in turn calls rcu_is_cpu_rrupt_from_idle().
3. rcu_is_cpu_rrupt_from_idle() reports the CPU was interrupted from
idle, which results in rcu_sched_qs() call, which does a
raise_softirq(RCU_SOFTIRQ).
4. Upon return from interrupt, rcu_irq_exit() is invoked, which calls
rcu_idle_enter_common(), which in turn calls rcu_sched_qs() again,
which does another raise_softirq(RCU_SOFTIRQ).
5. The softirq happens shortly and invokes rcu_process_callbacks(),
which invokes __rcu_process_callbacks().
6. So now callbacks can be invoked. At least they can be if
->donetail has been updated. Which it will have been because
rcu_sched_qs() invokes rcu_qsctr_help().
In the described scenario rcu_sched_qs() and raise_softirq(RCU_SOFTIRQ)
get called twice in steps 3 and 4. This redundancy could be eliminated
by removing rcu_is_cpu_rrupt_from_idle() function.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The x86 architecture has multiple types of NMI-like interrupts: real
NMIs, machine checks, and, for some values of NMI-like, debugging
and breakpoint interrupts. These interrupts can nest inside each
other. Andy Lutomirski is adding RCU support to these interrupts,
so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
This commit therefore introduces nesting, using a clever NMI-coordination
algorithm suggested by Andy. The trick is to atomically increment
->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
(and, accordingly, after on exit). In addition, ->dynticks_nmi_nesting
is incremented by one if ->dynticks was incremented and by two otherwise.
This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
to one, it knows that ->dynticks must be atomically incremented.
This NMI-coordination algorithms has been validated by the following
Promela model:
------------------------------------------------------------------------
/*
* Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
* that allows nesting.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, you can access it online at
* http://www.gnu.org/licenses/gpl-2.0.html.
*
* Copyright IBM Corporation, 2014
*
* Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*/
byte dynticks_nmi_nesting = 0;
byte dynticks = 0;
/*
* Promela verision of rcu_nmi_enter().
*/
inline rcu_nmi_enter()
{
byte incby;
byte tmp;
incby = BUSY_INCBY;
assert(dynticks_nmi_nesting >= 0);
if
:: (dynticks & 1) == 0 ->
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 1);
incby = 1;
:: else ->
skip;
fi;
tmp = dynticks_nmi_nesting;
tmp = tmp + incby;
dynticks_nmi_nesting = tmp;
assert(dynticks_nmi_nesting >= 1);
}
/*
* Promela verision of rcu_nmi_exit().
*/
inline rcu_nmi_exit()
{
byte tmp;
assert(dynticks_nmi_nesting > 0);
assert((dynticks & 1) != 0);
if
:: dynticks_nmi_nesting != 1 ->
tmp = dynticks_nmi_nesting;
tmp = tmp - BUSY_INCBY;
dynticks_nmi_nesting = tmp;
:: else ->
dynticks_nmi_nesting = 0;
atomic {
dynticks = dynticks + 1;
}
assert((dynticks & 1) == 0);
fi;
}
/*
* Base-level NMI runs non-atomically. Crudely emulates process-level
* dynticks-idle entry/exit.
*/
proctype base_NMI()
{
byte busy;
busy = 0;
do
:: /* Emulate base-level dynticks and not. */
if
:: 1 -> atomic {
dynticks = dynticks + 1;
}
busy = 1;
:: 1 -> skip;
fi;
/* Verify that we only sometimes have base-level dynticks. */
if
:: busy == 0 -> skip;
:: busy == 1 -> skip;
fi;
/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
/* Emulated re-entering base-level dynticks and not. */
if
:: !busy -> skip;
:: busy ->
atomic {
dynticks = dynticks + 1;
}
busy = 0;
fi;
/* We had better now be in dyntick-idle mode. */
assert((dynticks & 1) == 0);
od;
}
/*
* Nested NMI runs atomically to emulate interrupting base_level().
*/
proctype nested_NMI()
{
do
:: /*
* Use an atomic section to model a nested NMI. This is
* guaranteed to interleave into base_NMI() between a pair
* of base_NMI() statements, just as a nested NMI would.
*/
atomic {
/* Verify that we only sometimes are in dynticks. */
if
:: (dynticks & 1) == 0 -> skip;
:: (dynticks & 1) == 1 -> skip;
fi;
/* Model RCU's NMI entry and exit actions. */
rcu_nmi_enter();
assert((dynticks & 1) == 1);
rcu_nmi_exit();
}
od;
}
init {
run base_NMI();
run nested_NMI();
}
------------------------------------------------------------------------
The following script can be used to run this model if placed in
rcu_nmi.spin:
------------------------------------------------------------------------
if ! spin -a rcu_nmi.spin
then
echo Spin errors!!!
exit 1
fi
if ! cc -DSAFETY -o pan pan.c
then
echo Compilation errors!!!
exit 1
fi
./pan -m100000
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The current timecounter implementation will drop a variable amount
of resolution, depending on the magnitude of the time delta. In
other words, reading the clock too often or too close to a time
stamp conversion will introduce errors into the time values. This
patch fixes the issue by introducing a fractional nanosecond field
that accumulates the low order bits.
Reported-by: Janusz Użycki <j.uzycki@elproma.com.pl>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The timecounter code has almost nothing to do with the clocksource
code. Let it live in its own file. This will help isolate the
timecounter users from the clocksource users in the source tree.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen.
2) Fix receive checksum handling in enic driver, from Govindarajulu
Varadarajan.
3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from
Herbert Xu. Also, add code to detect drivers that have this mistake
in the future.
4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai.
5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP
input path,f rom Nicolas Dichtel.
6) Fix MPLS action validation in openvswitch, from Pravin B Shelar.
7) Fix double SKB free in vxlan driver, also from Pravin.
8) When we scrub a packet, which happens when we are switching the
context of the packet (namespace, etc.), we should reset the
secmark. From Thomas Graf.
9) ->ndo_gso_check() needs to do more than return true/false, it also
has to allow the driver to clear netdev feature bits in order for
the caller to be able to proceed properly. From Jesse Gross.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
genetlink: A genl_bind() to an out-of-range multicast group should not WARN().
netlink/genetlink: pass network namespace to bind/unbind
ne2k-pci: Add pci_disable_device in error handling
bonding: change error message to debug message in __bond_release_one()
genetlink: pass multicast bind/unbind to families
netlink: call unbind when releasing socket
netlink: update listeners directly when removing socket
genetlink: pass only network namespace to genl_has_listeners()
netlink: rename netlink_unbind() to netlink_undo_bind()
net: Generalize ndo_gso_check to ndo_features_check
net: incorrect use of init_completion fixup
neigh: remove next ptr from struct neigh_table
net: xilinx: Remove unnecessary temac_property in the driver
net: phy: micrel: use generic config_init for KSZ8021/KSZ8031
net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
openvswitch: fix odd_ptr_err.cocci warnings
Bluetooth: Fix accepting connections when not using mgmt
Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR
brcmfmac: Do not crash if platform data is not populated
ipw2200: select CFG80211_WEXT
...
Unfortunately, while commit 4a928436 ("audit: correctly record file
names with different path name types") fixed a problem where we were
not recording filenames, it created a new problem by attempting to use
these file names after they had been freed. This patch resolves the
issue by creating a copy of the filename which the audit subsystem
frees after it is done with the string.
At some point it would be nice to resolve this issue with refcounts,
or something similar, instead of having to allocate/copy strings, but
that is almost surely beyond the scope of a -rcX patch so we'll defer
that for later. On the plus side, only audit users should be impacted
by the string copying.
Reported-by: Toralf Foerster <toralf.foerster@gmx.de>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.
To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.
Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull audit fixes from Paul Moore:
"Four patches to fix various problems with the audit subsystem, all are
fairly small and straightforward.
One patch fixes a problem where we weren't using the correct gfp
allocation flags (GFP_KERNEL regardless of context, oops), one patch
fixes a problem with old userspace tools (this was broken for a
while), one patch fixes a problem where we weren't recording pathnames
correctly, and one fixes a problem with PID based filters.
In general I don't think there is anything controversial with this
patchset, and it fixes some rather unfortunate bugs; the allocation
flag one can be particularly scary looking for users"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: restore AUDIT_LOGINUID unset ABI
audit: correctly record file names with different path name types
audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb
audit: don't attempt to lookup PIDs when changing PID filtering audit rules
A regression was caused by commit 780a7654cee8:
audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)
When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.
This broke userspace by not returning the same information that was sent and
expected.
The rule:
auditctl -a exit,never -F auid=-1
gives:
auditctl -l
LIST_RULES: exit,never f24=0 syscall=all
when it should give:
LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all
Tag it so that it is reported the same way it was set. Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.
Cc: stable@vger.kernel.org # v3.10-rc1+
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
When allocating space for load_balance_mask, in sched_init, when
CPUMASK_OFFSTACK is set, we've managed to spill over
KMALLOC_MAX_SIZE on our 6144 core machine. The patch below
breaks up the allocations so that they don't overflow the max
alloc size. It also allocates the masks on the the node from
which they'll most commonly be accessed, to minimize remote
accesses on NUMA machines.
Suggested-by: George Beshers <gbeshers@sgi.com>
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: George Beshers <gbeshers@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Taking the global mutex "trace_types_lock" in the trace_pipe files
causes a bottle neck as most the pipe files can be read per cpu
and there's no reason to serialize them.
The current_trace variable was given a ref count and it can not
change when the ref count is not zero. Opening the trace_pipe
files will up the ref count (and decremented on close), so that
the lock no longer needs to be taken when accessing the
current_trace variable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I rebased Kees' 'param: do not set store func without write perm'
on top of my 'params: cleanup sysfs allocation'. However, my patch
uses krealloc which doesn't zero memory, leaving .store unset.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When one of the trace pipe files are being read (by either the trace_pipe
or trace_pipe_raw), do not allow the current_trace to change. By adding
a ref count that is incremented when the pipe files are opened, will
prevent the current_trace from being changed.
This will allow for the removal of the global trace_types_lock from
reading the pipe buffers (which is currently a bottle neck).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use the FTRACE_OPS_FL_IPMODIFY flag to prevent conflicts with other
ftrace users who also modify regs->ip.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
There is a problem with the audit system when multiple audit records
are created for the same path, each with a different path name type.
The root cause of the problem is in __audit_inode() when an exact
match (both the path name and path name type) is not found for a
path name record; the existing code creates a new path name record,
but it never sets the path name in this record, leaving it NULL.
This patch corrects this problem by assigning the path name to these
newly created records.
There are many ways to reproduce this problem, but one of the
easiest is the following (assuming auditd is running):
# mkdir /root/tmp/test
# touch /root/tmp/test/567
# auditctl -a always,exit -F dir=/root/tmp/test
# touch /root/tmp/test/567
Afterwards, or while the commands above are running, check the audit
log and pay special attention to the PATH records. A faulty kernel
will display something like the following for the file creation:
type=SYSCALL msg=audit(1416957442.025:93): arch=c000003e syscall=2
success=yes exit=3 ... comm="touch" exe="/usr/bin/touch"
type=CWD msg=audit(1416957442.025:93): cwd="/root/tmp"
type=PATH msg=audit(1416957442.025:93): item=0 name="test/"
inode=401409 ... nametype=PARENT
type=PATH msg=audit(1416957442.025:93): item=1 name=(null)
inode=393804 ... nametype=NORMAL
type=PATH msg=audit(1416957442.025:93): item=2 name=(null)
inode=393804 ... nametype=NORMAL
While a patched kernel will show the following:
type=SYSCALL msg=audit(1416955786.566:89): arch=c000003e syscall=2
success=yes exit=3 ... comm="touch" exe="/usr/bin/touch"
type=CWD msg=audit(1416955786.566:89): cwd="/root/tmp"
type=PATH msg=audit(1416955786.566:89): item=0 name="test/"
inode=401409 ... nametype=PARENT
type=PATH msg=audit(1416955786.566:89): item=1 name="test/567"
inode=393804 ... nametype=NORMAL
This issue was brought up by a number of people, but special credit
should go to hujianyang@huawei.com for reporting the problem along
with an explanation of the problem and a patch. While the original
patch did have some problems (see the archive link below), it did
demonstrate the problem and helped kickstart the fix presented here.
* https://lkml.org/lkml/2014/9/5/66
Reported-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
The execution flow redirection related implemention in the livepatch
ftrace handler is depended on the specific architecture. This patch
introduces klp_arch_set_pc(like kgdb_arch_set_pc) interface to change
the pt_regs.
Signed-off-by: Li Bin <huawei.libin@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This commit introduces code for the live patching core. It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.
It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.
This first version does not implement any consistency mechanism that
ensures that old and new code do not run together. In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check. However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.
[ jkosina@suse.cz: due to the number of contributions that got folded into
this original patch from Seth Jennings, add SUSE's copyright as well, as
discussed via e-mail ]
Signed-off-by: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched. This will provide a clean indication in
bug reports that live patching was used.
Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.
Signed-off-by: Seth Jennings <sjenning@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This removes the last few uses of CONFIG_PM_RUNTIME introduced
recently and makes that config option finally go away.
CONFIG_PM will be available directly from the menu now and
also it will be selected automatically if CONFIG_SUSPEND or
CONFIG_HIBERNATION is set.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUlYaPAAoJEILEb/54YlRx/SoP/2wYioGzBhOCYfHw6fZF8zrP
rotQ86sakhvSHre8K9QyFjvsA9wJ0CaTJF46YKZuHFhqU+IJZ7aXvNdEM1hK214J
Mf3L2AcbcdnXioAN+HpeZhQklp2qHe84YkVXBqsFD6kb/qUNV2LSjy6nKEUdY3jW
6KL2f3RgF/LDjTdedujJgcCYwMBwfX4B7U42BG4NQQ8z3wCV+imJgzNDrR5nNlqK
xu8ab8hO1Gi3msOJxS0y4MN6VTUpYOvQKhSyM9ErcB2ibclAdmcivKuFAz6gy5U7
PyDfYo/P3mXjMRBFb9fLqGtRcfstsnxPPSeKwp236tIQFX19Bj76UVUMJoUlXJP5
/f55/P7mCascg74ZZC4GiD/BSCRdqwInCsFMzqAfSq2NciKzeS6W7Mhd9VTLKDpl
5kqE39imUjZyps7/QqkfWskzB7Puhmqk3ZgTq2yAd4uQTpV7xlJYcnvr4oHCmAia
SsLdYOqMQzWr3qyz2f5cOqPAvOo3/Xk/HHfTOCHW/4L+Ov+C921/f3d5GnxX9Ha+
ucRaMp9j5FPYVwFaFkczAMNF2Eanq+Fupa3e6XUNNbYdchFqT9obnHZbVKyvswjR
vdGAYAjP/cLzIH9ETDCCXCRvBRw5pzeelDgvDPjPdmPjndHXG8WViyTIEyLL4+1i
BENtc/SUw3pZ7iNlGO78
=QnSO
-----END PGP SIGNATURE-----
Merge tag 'pm-config-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CONFIG_PM_RUNTIME elimination from Rafael Wysocki:
"This removes the last few uses of CONFIG_PM_RUNTIME introduced
recently and makes that config option finally go away.
CONFIG_PM will be available directly from the menu now and also it
will be selected automatically if CONFIG_SUSPEND or CONFIG_HIBERNATION
is set"
* tag 'pm-config-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: Eliminate CONFIG_PM_RUNTIME
tty: 8250_omap: Replace CONFIG_PM_RUNTIME with CONFIG_PM
sound: sst-haswell-pcm: Replace CONFIG_PM_RUNTIME with CONFIG_PM
spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
audit_log_end(), which can come from any context (aka even a sleeping context)
GFP_KERNEL can't be used. Since the audit_buffer knows what context it should
use, pass that down and use that.
See: https://lkml.org/lkml/2014/12/16/542
BUG: sleeping function called from invalid context at mm/slab.c:2849
in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
2 locks held by sulogin/885:
#0: (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff91152e30>] prepare_bprm_creds+0x28/0x8b
#1: (tty_files_lock){+.+.+.}, at: [<ffffffff9123e787>] selinux_bprm_committing_creds+0x55/0x22b
CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
Call Trace:
[<ffffffff916ba529>] dump_stack+0x50/0xa8
[<ffffffff91063185>] ___might_sleep+0x1b6/0x1be
[<ffffffff910632a6>] __might_sleep+0x119/0x128
[<ffffffff91140720>] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
[<ffffffff91141d81>] kmem_cache_alloc+0x43/0x1c9
[<ffffffff914e148d>] __alloc_skb+0x42/0x1a3
[<ffffffff914e2b62>] skb_copy+0x3e/0xa3
[<ffffffff910c263e>] audit_log_end+0x83/0x100
[<ffffffff9123b8d3>] ? avc_audit_pre_callback+0x103/0x103
[<ffffffff91252a73>] common_lsm_audit+0x441/0x450
[<ffffffff9123c163>] slow_avc_audit+0x63/0x67
[<ffffffff9123c42c>] avc_has_perm+0xca/0xe3
[<ffffffff9123dc2d>] inode_has_perm+0x5a/0x65
[<ffffffff9123e7ca>] selinux_bprm_committing_creds+0x98/0x22b
[<ffffffff91239e64>] security_bprm_committing_creds+0xe/0x10
[<ffffffff911515e6>] install_exec_creds+0xe/0x79
[<ffffffff911974cf>] load_elf_binary+0xe36/0x10d7
[<ffffffff9115198e>] search_binary_handler+0x81/0x18c
[<ffffffff91153376>] do_execveat_common.isra.31+0x4e3/0x7b7
[<ffffffff91153669>] do_execve+0x1f/0x21
[<ffffffff91153967>] SyS_execve+0x25/0x29
[<ffffffff916c61a9>] stub_execve+0x69/0xa0
Cc: stable@vger.kernel.org #v3.16-rc1
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Commit f1dc4867 ("audit: anchor all pid references in the initial pid
namespace") introduced a find_vpid() call when adding/removing audit
rules with PID/PPID filters; unfortunately this is problematic as
find_vpid() only works if there is a task with the associated PID
alive on the system. The following commands demonstrate a simple
reproducer.
# auditctl -D
# auditctl -l
# autrace /bin/true
# auditctl -l
This patch resolves the problem by simply using the PID provided by
the user without any additional validation, e.g. no calls to check to
see if the task/PID exists.
Cc: stable@vger.kernel.org # 3.15
Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Having switched over all of the users of CONFIG_PM_RUNTIME to use
CONFIG_PM directly, turn the latter into a user-selectable option
and drop the former entirely from the tree.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org>
Pull NOHZ update from Thomas Gleixner:
"Remove the call into the nohz idle code from the fake 'idle' thread in
the powerclamp driver along with the export of those functions which
was smuggeled in via the thermal tree. People have tried to hack
around it in the nohz core code, but it just violates all rightful
assumptions of that code about the only valid calling context (i.e.
the proper idle task).
The powerclamp trainwreck will still work, it just wont get the
benefit of long idle sleeps"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/powerclamp: Remove tick_nohz_idle abuse
Pull irq core fix from Thomas Gleixner:
"A single fix plugging a long standing race between proc/stat and
proc/interrupts access and freeing of interrupt descriptors"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Prevent proc race against freeing of irq descriptors
Pull perf fixes and cleanups from Ingo Molnar:
"A kernel fix plus mostly tooling fixes, but also some tooling
restructuring and cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
perf: Fix building warning on ARM 32
perf symbols: Fix use after free in filename__read_build_id
perf evlist: Use roundup_pow_of_two
tools: Adopt roundup_pow_of_two
perf tools: Make the mmap length autotuning more robust
tools: Adopt rounddown_pow_of_two and deps
tools: Adopt fls_long and deps
tools: Move bitops.h from tools/perf/util to tools/
tools: Introduce asm-generic/bitops.h
tools lib: Move asm-generic/bitops/find.h code to tools/include and tools/lib
tools: Whitespace prep patches for moving bitops.h
tools: Move code originally from asm-generic/atomic.h into tools/include/asm-generic/
tools: Move code originally from linux/log2.h to tools/include/linux/
tools: Move __ffs implementation to tools/include/asm-generic/bitops/__ffs.h
perf evlist: Do not use hard coded value for a mmap_pages default
perf trace: Let the perf_evlist__mmap autosize the number of pages to use
perf evlist: Improve the strerror_mmap method
perf evlist: Clarify sterror_mmap variable names
perf evlist: Fixup brown paper bag on "hint" for --mmap-pages cmdline arg
perf trace: Provide a better explanation when mmap fails
...
commit 4dbd27711c "tick: export nohz tick idle symbols for module
use" was merged via the thermal tree without an explicit ack from the
relevant maintainers.
The exports are abused by the intel powerclamp driver which implements
a fake idle state from a sched FIFO task. This causes all kinds of
wreckage in the NOHZ core code which rightfully assumes that
tick_nohz_idle_enter/exit() are only called from the idle task itself.
Recent changes in the NOHZ core lead to a failure of the powerclamp
driver and now people try to hack completely broken and backwards
workarounds into the NOHZ core code. This is completely unacceptable
and just papers over the real problem. There are way more subtle
issues lurking around the corner.
The real solution is to fix the powerclamp driver by rewriting it with
a sane concept, but that's beyond the scope of this.
So the only solution for now is to remove the calls into the core NOHZ
code from the powerclamp trainwreck along with the exports.
Fixes: d6d71ee4a1 "PM: Introduce Intel PowerClamp Driver"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
Cc: LKP <lkp@01.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
removal. This is possible by using a simple atomic_t for the counter,
rather than our fancy per-cpu counter: it turns out that no one is doing
a module increment per net packet, so the slowdown should be in the noise.
Also, script fixed for new git version.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUk3cQAAoJENkgDmzRrbjxr44P/25ZBYmKZZ3XM3flt2o0LCti
1Px+MRbWuXhueWQOYZSXOO3c2ENNuV3siaU4jQZqnxslpdvT4rVsVFkYuwva2vHT
hqpoq1Hz++yjFJArjERFOdoZ1gxkBbZQQGYm8esToAqU3b2Z74SrU48dPwp65q/1
r6hbXdWSiKALEBZeW2coi+QVCL/oxE8hmNqDO1mpe82aEKu0xIVpTdU5vAfBIj8/
Z95U2bx+CjiP7khhSjBGtltLqxL6QXw1m2eg1gO9nf1gJNI0/dAY6IJmFbGz+7Bt
CAyc9BRsB40Em8G7d7wr4FsURcLfmYNdjtx79j+Rot5PkVIi+Ztv7C1QYlMQESPa
ESddUMySOmKlzTm50w3ZLvV1ZTRU8TjmttSkzQYZ3csCLkKUgfeL9SAxU9KGoA2l
jFxrvDcWEHtuU1D/FeYyOofNaD/BflPfdhj4WAm9XnPPi+THEu7fulWJaIP4glHh
8TpYNbinXuZqXO4nJ41Ad5utbSbBQa4fFBUuViWRTU0TtWJT2HVqn/XoYJ5mnPEz
IbYh31rQDKFJKzePfscWrJ6XzoF59yGiAVcWcI3HS7aT8bFZGapAQu9mNCVu+cLF
uRxWrukHG7d8YeYrAtbVXWfxArR155V9QJN55hQ1nKLq2M03gNvYTtAPw2yEsfuw
u3Fk/KkV1RfaiFurjoG/
=rDum
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"The exciting thing here is the getting rid of stop_machine on module
removal. This is possible by using a simple atomic_t for the counter,
rather than our fancy per-cpu counter: it turns out that no one is
doing a module increment per net packet, so the slowdown should be in
the noise"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
param: do not set store func without write perm
params: cleanup sysfs allocation
kernel:module Fix coding style errors and warnings.
module: Remove stop_machine from module unloading
module: Replace module_ref with atomic_t refcnt
lib/bug: Use RCU list ops for module_bug_list
module: Unlink module with RCU synchronizing instead of stop_machine
module: Wait for RCU synchronizing before releasing a module
- Fix a regression in leds-gpio introduced by a recent commit that
inadvertently changed the name of one of the properties used by
the driver (Fabio Estevam).
- Fix a regression in the ACPI backlight driver introduced by a
recent fix that missed one special case that had to be taken
into account (Aaron Lu).
- Drop the level of some new kernel messages from the ACPI core
introduced by a recent commit to KERN_DEBUG which they should
have used from the start and drop some other unuseful KERN_ERR
messages printed by ACPI (Rafael J Wysocki).
- Revert an incorrect commit modifying the cpupower tool
(Prarit Bhargava).
- Fix two regressions introduced by recent commits in the OPP
library and clean up some existing minor issues in that code
(Viresh Kumar).
- Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout
the tree (or drop it where that can be done) in order to make
it possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki,
Ulf Hansson, Ludovic Desroches). There will be one more
"CONFIG_PM_RUNTIME removal" batch after this one, because some
new uses of it have been introduced during the current merge
window, but that should be sufficient to finally get rid of it.
- Make the ACPI EC driver more robust against race conditions
related to GPE handler installation failures (Lv Zheng).
- Prevent the ACPI device PM core code from attempting to
disable GPEs that it has not enabled which confuses ACPICA
and makes it report errors unnecessarily (Rafael J Wysocki).
- Add a "force" command line switch to the intel_pstate driver
to make it possible to override the blacklisting of some
systems in that driver if needed (Ethan Zhao).
- Improve intel_pstate code documentation and add a MAINTAINERS
entry for it (Kristen Carlson Accardi).
- Make the ACPI fan driver create cooling device interfaces
witn names that reflect the IDs of the ACPI device objects
they are associated with, except for "generic" ACPI fans
(PNP ID "PNP0C0B"). That's necessary for user space thermal
management tools to be able to connect the fans with the
parts of the system they are supposed to be cooling properly.
From Srinivas Pandruvada.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUk0IDAAoJEILEb/54YlRx7fgP/3+yF/0TnEW93j2ALDAQFiLF
tSv2A2vQC8vtMJjjWx0z/HqPh86gfaReEFZmUJD/Q/e2LXEnxNZJ+QMjcekPVkDM
mTvcIMc2MR8vOA/oMkgxeaKregrrx7RkCfojd+NWZhVukkjl+mvBHgAnYjXRL+NZ
unDWGlbHG97vq/3kGjPYhDS00nxHblw8NHFBu5HL5RxwABdWoeZJITwqxXWyuPLw
nlqNWlOxmwvtSbw2VMKz0uof1nFHyQLykYsMG0ZsyayCRdWUZYkEqmE7GGpCLkLu
D6yfmlpen6ccIOsEAae0eXBt50IFY9Tihk5lovx1mZmci2SNRg29BqMI105wIn0u
8b8Ej7MNHp7yMxRpB5WfU90p/y7ioJns9guFZxY0CKaRnrI2+BLt3RscMi3MPI06
Cu2/WkSSa09fhDPA+pk+VDYsmWgyVawigesNmMP5/cvYO/yYywVRjOuO1k77qQGp
4dSpFYEHfpxinejZnVZOk2V9MkvSLoSMux6wPV0xM0IE1iD0ulVpHjTJrwp80ph4
+bfUFVr/vrD1y7EKbf1PD363ZKvJhWhvQWDgETsM1vgLf21PfWO7C2kflIAsWsdQ
1ukD5nCBRlP4K73hG7bdM6kRztXhUdR0SHg85/t0KB/ExiVqtcXIzB60D0G1lENd
QlKbq3O4lim1WGuhazQY
=5fo2
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
"These are regression fixes (leds-gpio, ACPI backlight driver,
operating performance points library, ACPI device enumeration
messages, cpupower tool), other bug fixes (ACPI EC driver, ACPI device
PM), some cleanups in the operating performance points (OPP)
framework, continuation of CONFIG_PM_RUNTIME elimination, a couple of
minor intel_pstate driver changes, a new MAINTAINERS entry for it and
an ACPI fan driver change needed for better support of thermal
management in user space.
Specifics:
- Fix a regression in leds-gpio introduced by a recent commit that
inadvertently changed the name of one of the properties used by the
driver (Fabio Estevam).
- Fix a regression in the ACPI backlight driver introduced by a
recent fix that missed one special case that had to be taken into
account (Aaron Lu).
- Drop the level of some new kernel messages from the ACPI core
introduced by a recent commit to KERN_DEBUG which they should have
used from the start and drop some other unuseful KERN_ERR messages
printed by ACPI (Rafael J Wysocki).
- Revert an incorrect commit modifying the cpupower tool (Prarit
Bhargava).
- Fix two regressions introduced by recent commits in the OPP library
and clean up some existing minor issues in that code (Viresh
Kumar).
- Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout the
tree (or drop it where that can be done) in order to make it
possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki, Ulf
Hansson, Ludovic Desroches).
There will be one more "CONFIG_PM_RUNTIME removal" batch after this
one, because some new uses of it have been introduced during the
current merge window, but that should be sufficient to finally get
rid of it.
- Make the ACPI EC driver more robust against race conditions related
to GPE handler installation failures (Lv Zheng).
- Prevent the ACPI device PM core code from attempting to disable
GPEs that it has not enabled which confuses ACPICA and makes it
report errors unnecessarily (Rafael J Wysocki).
- Add a "force" command line switch to the intel_pstate driver to
make it possible to override the blacklisting of some systems in
that driver if needed (Ethan Zhao).
- Improve intel_pstate code documentation and add a MAINTAINERS entry
for it (Kristen Carlson Accardi).
- Make the ACPI fan driver create cooling device interfaces witn
names that reflect the IDs of the ACPI device objects they are
associated with, except for "generic" ACPI fans (PNP ID "PNP0C0B").
That's necessary for user space thermal management tools to be able
to connect the fans with the parts of the system they are supposed
to be cooling properly. From Srinivas Pandruvada"
* tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
MAINTAINERS: add entry for intel_pstate
ACPI / video: update the skip case for acpi_video_device_in_dod()
power / PM: Eliminate CONFIG_PM_RUNTIME
NFC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
SCSI / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
ACPI / EC: Fix unexpected ec_remove_handlers() invocations
Revert "tools: cpupower: fix return checks for sysfs_get_idlestate_count()"
tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
x86 / PM: Replace CONFIG_PM_RUNTIME in io_apic.c
PM: Remove the SET_PM_RUNTIME_PM_OPS() macro
mmc: atmel-mci: use SET_RUNTIME_PM_OPS() macro
PM / Kconfig: Replace PM_RUNTIME with PM in dependencies
ARM / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
sound / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
phy / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
video / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
tty / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
ACPI / PM: Do not disable wakeup GPEs that have not been enabled
ACPI / utils: Drop error messages from acpi_evaluate_reference()
...
When a module_param is defined without DAC write permissions, it can
still be changed at runtime and updated. Drivers using a 0444 permission
may be surprised that these values can still be changed.
For drivers that want to allow updates, any S_IW* flag will set the
"store" function as before. Drivers without S_IW* flags will have the
"store" function unset, unforcing a read-only value. Drivers that wish
neither "store" nor "get" can continue to use "0" for perms to stay out
of sysfs entirely.
Old behavior:
# cd /sys/module/snd/parameters
# ls -l
total 0
-r--r--r-- 1 root root 4096 Dec 11 13:55 cards_limit
-r--r--r-- 1 root root 4096 Dec 11 13:55 major
-r--r--r-- 1 root root 4096 Dec 11 13:55 slots
# cat major
116
# echo -1 > major
-bash: major: Permission denied
# chmod u+w major
# echo -1 > major
# cat major
-1
New behavior:
...
# chmod u+w major
# echo -1 > major
-bash: echo: write error: Input/output error
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull user namespace related fixes from Eric Biederman:
"As these are bug fixes almost all of thes changes are marked for
backporting to stable.
The first change (implicitly adding MNT_NODEV on remount) addresses a
regression that was created when security issues with unprivileged
remount were closed. I go on to update the remount test to make it
easy to detect if this issue reoccurs.
Then there are a handful of mount and umount related fixes.
Then half of the changes deal with the a recently discovered design
bug in the permission checks of gid_map. Unix since the beginning has
allowed setting group permissions on files to less than the user and
other permissions (aka ---rwx---rwx). As the unix permission checks
stop as soon as a group matches, and setgroups allows setting groups
that can not later be dropped, results in a situtation where it is
possible to legitimately use a group to assign fewer privileges to a
process. Which means dropping a group can increase a processes
privileges.
The fix I have adopted is that gid_map is now no longer writable
without privilege unless the new file /proc/self/setgroups has been
set to permanently disable setgroups.
The bulk of user namespace using applications even the applications
using applications using user namespaces without privilege remain
unaffected by this change. Unfortunately this ix breaks a couple user
space applications, that were relying on the problematic behavior (one
of which was tools/selftests/mount/unprivileged-remount-test.c).
To hopefully prevent needing a regression fix on top of my security
fix I rounded folks who work with the container implementations mostly
like to be affected and encouraged them to test the changes.
> So far nothing broke on my libvirt-lxc test bed. :-)
> Tested with openSUSE 13.2 and libvirt 1.2.9.
> Tested-by: Richard Weinberger <richard@nod.at>
> Tested on Fedora20 with libvirt 1.2.11, works fine.
> Tested-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
> Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
> Just to be sure I was testing the right thing I also tested using
> my unprivileged nsexec testcases, and they failed on setgroup/setgid
> as now expected, and succeeded there without your patches.
> Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> I tested this with Sandstorm. It breaks as is and it works if I add
> the setgroups thing.
> Tested-by: Andy Lutomirski <luto@amacapital.net> # breaks things as designed :("
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Unbreak the unprivileged remount tests
userns; Correct the comment in map_write
userns: Allow setting gid_maps without privilege when setgroups is disabled
userns: Add a knob to disable setgroups on a per user namespace basis
userns: Rename id_map_mutex to userns_state_mutex
userns: Only allow the creator of the userns unprivileged mappings
userns: Check euid no fsuid when establishing an unprivileged uid mapping
userns: Don't allow unprivileged creation of gid mappings
userns: Don't allow setgroups until a gid mapping has been setablished
userns: Document what the invariant required for safe unprivileged mappings.
groups: Consolidate the setgroups permission checks
mnt: Clear mnt_expire during pivot_root
mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
umount: Do not allow unmounting rootfs.
umount: Disallow unprivileged mount force
mnt: Update unprivileged remount test
mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
Pull vfs pile #2 from Al Viro:
"Next pile (and there'll be one or two more).
The large piece in this one is getting rid of /proc/*/ns/* weirdness;
among other things, it allows to (finally) make nameidata completely
opaque outside of fs/namei.c, making for easier further cleanups in
there"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_venus_readdir(): use file_inode()
fs/namei.c: fold link_path_walk() call into path_init()
path_init(): don't bother with LOOKUP_PARENT in argument
fs/namei.c: new helper (path_cleanup())
path_init(): store the "base" pointer to file in nameidata itself
make default ->i_fop have ->open() fail with ENXIO
make nameidata completely opaque outside of fs/namei.c
kill proc_ns completely
take the targets of /proc/*/ns/* symlinks to separate fs
bury struct proc_ns in fs/proc
copy address of proc_ns_ops into ns_common
new helpers: ns_alloc_inum/ns_free_inum
make proc_ns_operations work with struct ns_common * instead of void *
switch the rest of proc_ns_operations to working with &...->ns
netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
common object embedded into various struct ....ns
as I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init(). By passing
in the trace_event= kernel command line parameter, tracepoints can be
enabled at boot up. For debugging things like the initialization of
interrupts, it is needed to have tracepoints enabled very early. People
have asked about this before and this has been on my todo list. As it
can be helpful for Thomas to debug his upcoming 3.20 IRQ work, I'm
pushing this now. This way he can add tracepoints into the IRQ set up
and have users enable them when things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered. If the irq code locks up or reboots the box, having the
tracepoint output go into the kernel ring buffer is useless for
debugging. But being able to add the tp_printk kernel command line
option along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas tried
them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjv3kAAoJEEjnJuOKh9ldLNsIANAe5EmDCBw0WjR72n+G3qOH
NC8calXfkjqHU0bv8Q3dRv20KH4MHOy6l4+EiV9/ovt71LOF3NEyUJ3HuShf9a8b
sWcUhYbX3D1hViQe5sOzv9AWhBCFlKQGoNmQnydX9xa8ivRsBaTGJIGktWlHcwBE
jF1i3fj3l3vRQSS8qZFXp3bzreunlGyPoSHcT6eWQeos+utj4sKwQWTLXTLQeM+6
oQtFKRx7E5yX04qO1qFczS8qIEC6JH2C2jIRYEKUGepaELlnGkb8O7jQV/RaLF4/
6P8VhZFG9YLS7fn7vWu0SnAN+Zwz5LzgjXAZt0FhGtIhLc18Oj8ouHH1UORsdQM=
=Z4Un
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"As the merge window is still open, and this code was not as complex as
I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init().
By passing in the trace_event= kernel command line parameter,
tracepoints can be enabled at boot up. For debugging things like
the initialization of interrupts, it is needed to have tracepoints
enabled very early. People have asked about this before and this
has been on my todo list. As it can be helpful for Thomas to debug
his upcoming 3.20 IRQ work, I'm pushing this now. This way he can
add tracepoints into the IRQ set up and have users enable them when
things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered.
If the irq code locks up or reboots the box, having the tracepoint
output go into the kernel ring buffer is useless for debugging.
But being able to add the tp_printk kernel command line option
along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas
tried them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"
* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add tp_printk cmdline to have tracepoints go to printk()
tracing: Move enabling tracepoints to just after rcu_init()
Pull drm updates from Dave Airlie:
"Highlights:
- AMD KFD driver merge
This is the AMD HSA interface for exposing a lowlevel interface for
GPGPU use. They have an open source userspace built on top of this
interface, and the code looks as good as it was going to get out of
tree.
- Initial atomic modesetting work
The need for an atomic modesetting interface to allow userspace to
try and send a complete set of modesetting state to the driver has
arisen, and been suffering from neglect this past year. No more,
the start of the common code and changes for msm driver to use it
are in this tree. Ongoing work to get the userspace ioctl finished
and the code clean will probably wait until next kernel.
- DisplayID 1.3 and tiled monitor exposed to userspace.
Tiled monitor property is now exposed for userspace to make use of.
- Rockchip drm driver merged.
- imx gpu driver moved out of staging
Other stuff:
- core:
panel - MIPI DSI + new panels.
expose suggested x/y properties for virtual GPUs
- i915:
Initial Skylake (SKL) support
gen3/4 reset work
start of dri1/ums removal
infoframe tracking
fixes for lots of things.
- nouveau:
tegra k1 voltage support
GM204 modesetting support
GT21x memory reclocking work
- radeon:
CI dpm fixes
GPUVM improvements
Initial DPM fan control
- rcar-du:
HDMI support added
removed some support for old boards
slave encoder driver for Analog Devices adv7511
- exynos:
Exynos4415 SoC support
- msm:
a4xx gpu support
atomic helper conversion
- tegra:
iommu support
universal plane support
ganged-mode DSI support
- sti:
HDMI i2c improvements
- vmwgfx:
some late fixes.
- qxl:
use suggested x/y properties"
* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
drm: sti: fix module compilation issue
drm/i915: save/restore GMBUS freq across suspend/resume on gen4
drm: sti: correctly cleanup CRTC and planes
drm: sti: add HQVDP plane
drm: sti: add cursor plane
drm: sti: enable auxiliary CRTC
drm: sti: fix delay in VTG programming
drm: sti: prepare sti_tvout to support auxiliary crtc
drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
drm: sti: fix hdmi avi infoframe
drm: sti: remove event lock while disabling vblank
drm: sti: simplify gdp code
drm: sti: clear all mixer control
drm: sti: remove gpio for HDMI hot plug detection
drm: sti: allow to change hdmi ddc i2c adapter
drm/doc: Document drm_add_modes_noedid() usage
drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
drm: Zero out DRM object memory upon cleanup
drm/i915/bdw: Fix the write setting up the WIZ hashing mode
...
Add the kernel command line tp_printk option that will have tracepoints
that are active sent to printk() as well as to the trace buffer.
Passing "tp_printk" will activate this. To turn it off, the sysctl
/proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note,
this only works if the cmdline option is used. Echoing 1 into the sysctl
file without the cmdline option will have no affect.
Note, this is a dangerous option. Having high frequency tracepoints send
their data to printk() can possibly cause a live lock. This is another
reason why this is only active if the command line option is used.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enabling tracepoints at boot up can be very useful. The tracepoint
can be initialized right after RCU has been. There's no need to
wait for the early_initcall() to be called. That's too late for some
things that can use tracepoints for debugging. Move the logic to
enable tracepoints out of the initcalls and into init/main.c to
right after rcu_init().
This also allows trace_printk() to be used early too.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.org
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Here's the big tty/serial driver update for 3.19-rc1.
There are a number of TTY core changes/fixes in here from Peter Hurley
that have all been teted in linux-next for a long time now. There are
also the normal serial driver updates as well, full details in the
changelog below.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlSOD/MACgkQMUfUDdst+ymW+wCfbSzoYMRObIImMPWfoQtxkvvN
rpkAnAtyEP/zZIfkQIuKTSH6FJxocF8V
=WZt3
-----END PGP SIGNATURE-----
Merge tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here's the big tty/serial driver update for 3.19-rc1.
There are a number of TTY core changes/fixes in here from Peter Hurley
that have all been teted in linux-next for a long time now. There are
also the normal serial driver updates as well, full details in the
changelog below"
* tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
serial: pxa: hold port.lock when reporting modem line changes
tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
tty: Deletion of unnecessary checks before two function calls
n_tty: Fix read_buf race condition, increment read_head after pushing data
serial: of-serial: add PM suspend/resume support
Revert "serial: of-serial: add PM suspend/resume support"
Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
serial: 8250: don't attempt a trylock if in sysrq
serial: core: Add big-endian iotype
serial: samsung: use port->fifosize instead of hardcoded values
serial: samsung: prefer to use fifosize from driver data
serial: samsung: fix style problems
serial: samsung: wait for transfer completion before clock disable
serial: icom: fix error return code
serial: tegra: clean up tty-flag assignments
serial: Fix io address assign flow with Fintek PCI-to-UART Product
serial: mxs-auart: fix tx_empty against shift register
serial: mxs-auart: fix gpio change detection on interrupt
serial: mxs-auart: Fix mxs_auart_set_ldisc()
serial: 8250_dw: Use 64-bit access for OCTEON.
...
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
o Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
o The seq_buf code was change to model the seq_file code. I have
a patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was done
to make sure that seq_buf.c is compatible with seq_file.c. I may
try to get that patch in for 3.20.
o The seq_buf.c file was moved to lib/ to remove it from being dependent
on CONFIG_TRACING.
o The printk() was updated to allow for a per_cpu "override" of
the internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow the
NMI to change what printk() does in order to call dump_stack() without
needing to update that code as well.
o Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would wait
till all the NMIs finished, and then it would print the seq_buf
data to the console safely from a non NMI context.
[ Updated to remove unnecessary preempt_disable in printk() ]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUi8A8AAoJEEjnJuOKh9ldv0sH/A+l9Ewrc3Kd0XuUKX9UO9Mj
yrDz5dSWTxD6Pi7ni5Zo2f/MebXhrgS8gF1MBN1HMS5s9/9XdTTQijosOfs75iFd
xufiDur7ssl2EOLB/ouqWVn16tu1PrPyw+U76JUZvsYlIMSWQu2FH8DSdo59N6Iz
7RxS8rtxJ2IwehmO7tu2Lq5rB7zGL4SET5oIfQ1+KnjzqB5Z1bfm9nGwAc8nozx8
3MqwsClEnXBTkY4eYZzu9wD7Nl/eknzTrk8KDbQ49oTYmoBuuh/s1FMuxe75cY55
wEtDA6HvvTXYnw6YOAMUB41cGnRg3KVRmmhcH5T9jrBxg2iZjXYa8iZxvcAM6Es=
=zDMJ
-----END PGP SIGNATURE-----
Merge tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixlet from Steven Rostedt:
"Remove unnecessary preempt_disable in printk()"
* tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
printk: Do not disable preemption for accessing printk_func
Pull audit updates from Paul Moore:
"Two small patches from the audit next branch; only one of which has
any real significant code changes, the other is simply a MAINTAINERS
update for audit.
The single code patch is pretty small and rather straightforward, it
changes the audit "version" number reported to userspace from an
integer to a bitmap which is used to indicate the functionality of the
running kernel. This really doesn't have much impact on the kernel,
but it will make life easier for the audit userspace folks.
Thankfully we were still on a version number which allowed us to do
this without breaking userspace"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: convert status version to a feature bitmap
audit: add Paul Moore to the MAINTAINERS entry
There's a lot of common code in inode and mount marks handling. Factor it
out to a common helper function.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following the suggestions from Andrew Morton and Stephen Rothwell,
Dont expand the ARCH list in kernel/gcov/Kconfig. Instead,
define a ARCH_HAS_GCOV_PROFILE_ALL bool which architectures
can enable.
set ARCH_HAS_GCOV_PROFILE_ALL on Architectures where it was
previously allowed + ARM64 which I tested.
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current stacktrace only have the function for console output. page_owner
that will be introduced in following patch needs to print the output of
stacktrace into the buffer for our own output format so so new function,
snprint_stack_trace(), is needed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both register and unregister call build_map_info() in order to create the
list of mappings before installing or removing breakpoints for every mm
which maps file backed memory. As such, there is no reason to hold the
i_mmap_rwsem exclusively, so share it and allow concurrent readers to
build the mapping data.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory. To
this end, this lock can also be a rwsem. In addition, there are some
important opportunities to share the lock when there are no tree
modifications.
This conversion is straightforward. For now, all users take the write
lock.
[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so files that are
build conditionally if CONFIG_PM_RUNTIME is set may now be build
if CONFIG_PM is set.
Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile
for this reason.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org.
Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock
Pull workqueue update from Tejun Heo:
"Work items which may be involved in memory reclaim path may be
executed by the rescuer under memory pressure. When a rescuer gets
activated, it processes whatever are on the pending list and then goes
back to sleep until the manager kicks it again which involves 100ms
delay.
This is problematic for self-requeueing work items or the ones running
on ordered workqueues as there always is only one work item on the
pending list when the rescuer kicks in. The execution of that work
item produces more to execute but the rescuer won't see them until
after the said 100ms has passed, so such workqueues would only execute
one work item every 100ms under prolonged memory pressure, which BTW
may be being prolonged due to the slow execution.
Neil wrote up a patch which fixes this issue by keeping the rescuer
working as long as the target workqueue is busy but doesn't have
enough workers"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: allow rescuer thread to do more work.
workqueue: invert the order between pool->lock and wq_mayday_lock
workqueue: cosmetic update in rescuer_thread()
Pull percpu updates from Tejun Heo:
"Nothing interesting. A patch to convert the remaining __get_cpu_var()
users, another to fix non-critical off-by-one in an assertion and a
cosmetic conversion to lockless_dereference() in percpu-ref.
The back-merge from mainline is to receive lockless_dereference()"
* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Replace smp_read_barrier_depends() with lockless_dereference()
percpu: Convert remaining __get_cpu_var uses in 3.18-rcX
percpu: off by one in BUG_ON()
Here's a batch of i915 fixes for 3.19.
* tag 'drm-intel-next-fixes-2014-12-11' of git://anongit.freedesktop.org/drm-intel:
drm/i915: save/restore GMBUS freq across suspend/resume on gen4
drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
drm/i915/bdw: Fix the write setting up the WIZ hashing mode
drm/i915: Don't complain about stolen conflicts on gen3
drm/i915: resume MST after reading back hw state
drm/i915: Handle inaccurate time conversion issues
drm/i915: compute wait_ioctl timeout correctly
drm/i915: don't always do full mode sets when infoframes are enabled
Pull s390 updates from Martin Schwidefsky:
"The most notable change for this pull request is the ftrace rework
from Heiko. It brings a small performance improvement and the ground
work to support a new gcc option to replace the mcount blocks with a
single nop.
Two new s390 specific system calls are added to emulate user space
mmio for PCI, an artifact of the how PCI memory is accessed.
Two patches for the memory management with changes to common code.
For KVM mm_forbids_zeropage is added which disables the empty zero
page for an mm that is used by a KVM process. And an optimization,
pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full.
Some micro optimization for the cmpxchg and the spinlock code.
And as usual bug fixes and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
s390/cputime: fix 31-bit compile
s390/scm_block: make the number of reqs per HW req configurable
s390/scm_block: handle multiple requests in one HW request
s390/scm_block: allocate aidaw pages only when necessary
s390/scm_block: use mempool to manage aidaw requests
s390/eadm: change timeout value
s390/mm: fix memory leak of ptlock in pmd_free_tlb
s390: use local symbol names in entry[64].S
s390/ptrace: always include vector registers in core files
s390/simd: clear vector register pointer on fork/clone
s390: translate cputime magic constants to macros
s390/idle: convert open coded idle time seqcount
s390/idle: add missing irq off lockdep annotation
s390/debug: avoid function call for debug_sprintf_*
s390/kprobes: fix instruction copy for out of line execution
s390: remove diag 44 calls from cpu_relax()
s390/dasd: retry partition detection
s390/dasd: fix list corruption for sleep_on requests
s390/dasd: fix infinite term I/O loop
s390/dasd: remove unused code
...
It is important that all maps are less than PAGE_SIZE
or else setting the last byte of the buffer to '0'
could write off the end of the allocated storage.
Correct the misleading comment.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that setgroups can be disabled and not reenabled, setting gid_map
without privielge can now be enabled when setgroups is disabled.
This restores most of the functionality that was lost when unprivileged
setting of gid_map was removed. Applications that use this functionality
will need to check to see if they use setgroups or init_groups, and if they
don't they can be fixed by simply disabling setgroups before writing to
gid_map.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Expose the knob to user space through a proc file /proc/<pid>/setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.
A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.
- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.
- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.
This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.
A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull networking updates from David Miller:
1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.
This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu
2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.
3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.
4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.
5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.
6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.
7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.
8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.
9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.
10) Support TSO/LSO in sunvnet driver, from David L Stevens.
11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.
12) Remote checksum offload, from Tom Herbert.
13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.
14) Add MPLS support to openvswitch, from Simon Horman.
15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.
16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.
17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.
18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.
19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.
20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.
21) Add VLAN packet scheduler action, from Jiri Pirko.
22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...
As printk_func will either be the default function, or a per_cpu function
for the current CPU, there's no reason to disable preemption to access
it from printk. That's because if the printk_func is not the default
then the caller had better disabled preemption as they were the one to
change it.
Link: http://lkml.kernel.org/r/CA+55aFz5-_LKW4JHEBoWinN9_ouNcGRWAF2FUA35u46FRN-Kxw@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We allow PMU driver to change the cpu on which the event
should be installed to. This happened in patch:
e2d37cd213 ("perf: Allow the PMU driver to choose the CPU on which to install events")
This patch also forces all the group members to follow
the currently opened events cpu if the group happened
to be moved.
This and the change of event->cpu in perf_install_in_context()
function introduced in:
0cda4c0231 ("perf: Introduce perf_pmu_migrate_context()")
forces group members to change their event->cpu,
if the currently-opened-event's PMU changed the cpu
and there is a group move.
Above behaviour causes problem for breakpoint events,
which uses event->cpu to touch cpu specific data for
breakpoints accounting. By changing event->cpu, some
breakpoints slots were wrongly accounted for given
cpu.
Vinces's perf fuzzer hit this issue and caused following
WARN on my setup:
WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
Can't find any breakpoint slot
[...]
This patch changes the group moving code to keep the event's
original cpu.
Reported-by: Vince Weaver <vince@deater.net>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Yan, Zheng <zheng.z.yan@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This time we have some more new material than we used to have during
the last couple of development cycles.
The most important part of it to me is the introduction of a unified
interface for accessing device properties provided by platform
firmware. It works with Device Trees and ACPI in a uniform way and
drivers using it need not worry about where the properties come
from as long as the platform firmware (either DT or ACPI) makes
them available. It covers both devices and "bare" device node
objects without struct device representation as that turns out to
be necessary in some cases. This has been in the works for quite
a few months (and development cycles) and has been approved by
all of the relevant maintainers.
On top of that, some drivers are switched over to the new interface
(at25, leds-gpio, gpio_keys_polled) and some additional changes are
made to the core GPIO subsystem to allow device drivers to manipulate
GPIOs in the "canonical" way on platforms that provide GPIO information
in their ACPI tables, but don't assign names to GPIO lines (in which
case the driver needs to do that on the basis of what it knows about
the device in question). That also has been approved by the GPIO
core maintainers and the rfkill driver is now going to use it.
Second is support for hardware P-states in the intel_pstate driver.
It uses CPUID to detect whether or not the feature is supported by
the processor in which case it will be enabled by default. However,
it can be disabled entirely from the kernel command line if necessary.
Next is support for a platform firmware interface based on ACPI
operation regions used by the PMIC (Power Management Integrated
Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
That interface is used for manipulating power resources and for
thermal management: sensor temperature reporting, trip point setting
and so on.
Also the ACPI core is now going to support the _DEP configuration
information in a limited way. Basically, _DEP it supposed to reflect
off-the-hierarchy dependencies between devices which may be very
indirect, like when AML for one device accesses locations in an
operation region handled by another device's driver (usually, the
device depended on this way is a serial bus or GPIO controller).
The support added this time is sufficient to make the ACPI battery
driver work on Asus T100A, but it is general enough to be able to
cover some other use cases in the future.
Finally, we have a new cpufreq driver for the Loongson1B processor.
In addition to the above, there are fixes and cleanups all over the
place as usual and a traditional ACPICA update to a recent upstream
release.
As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver
for Intel platforms should be able to handle power management of
the DMA engine correctly, the cpufreq-dt driver should interact
with the thermal subsystem in a better way and the ACPI backlight
driver should handle some more corner cases, among other things.
On top of the ACPICA update there are fixes for race conditions
in the ACPICA's interrupt handling code which might lead to some
random and strange looking failures on some systems.
In the cleanups department the most visible part is the series
of commits targeted at getting rid of the CONFIG_PM_RUNTIME
configuration option. That was triggered by a discussion
regarding the generic power domains code during which we realized
that trying to support certain combinations of PM config options
was painful and not really worth it, because nobody would use them
in production anyway. For this reason, we decided to make
CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the
conclusion that the latter became redundant and CONFIG_PM could
be used instead of it. The material here makes that replacement
in a major part of the tree, but there will be at least one more
batch of that in the second part of the merge window.
Specifics:
- Support for retrieving device properties information from ACPI
_DSD device configuration objects and a unified device properties
interface for device drivers (and subsystems) on top of that.
As stated above, this works with Device Trees and ACPI and allows
device drivers to be written in a platform firmware (DT or ACPI)
agnostic way. The at25, leds-gpio and gpio_keys_polled drivers
are now going to use this new interface and the GPIO subsystem
is additionally modified to allow device drivers to assign names
to GPIO resources returned by ACPI _CRS objects (in case _DSD is
not present or does not provide the expected data). The changes
in this set are mostly from Mika Westerberg, Rafael J Wysocki,
Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam,
Geert Uytterhoeven).
- Support for Hardware Managed Performance States (HWP) as described
in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
driver. CPUID is used to detect whether or not the feature is
supported by the processor. If supported, it will be enabled
automatically unless the intel_pstate=no_hwp switch is present in
the kernel command line. From Dirk Brandewie.
- New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).
- Support for firmware interface based on ACPI operation regions
used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
platforms for power resource control and thermal management
(Aaron Lu).
- Limited support for retrieving off-the-hierarchy dependencies
between devices from ACPI _DEP device configuration objects
and deferred probing support for the ACPI battery driver based
on the _DEP information to make that driver work on Asus T100A
(Lan Tianyu).
- New cpufreq driver for the Loongson1B processor (Kelvin Cheung).
- ACPICA update to upstream revision 20141107 which only affects
tools (Bob Moore).
- Fixes for race conditions in the ACPICA's interrupt handling
code and in the ACPI code related to system suspend and resume
(Lv Zheng and Rafael J Wysocki).
- ACPI core fix for an RCU-related issue in the ioremap() regions
management code that slowed down significantly after CPUs had
been allowed to enter idle states even if they'd had RCU callbakcs
queued and triggered some problems in certain proprietary graphics
driver (and elsewhere). The fix replaces synchronize_rcu() in
that code with synchronize_rcu_expedited() which makes the issue
go away. From Konstantin Khlebnikov.
- ACPI LPSS (Low-Power Subsystem) driver fix to handle power
management of the DMA engine included into the LPSS correctly.
The problem is that the DMA engine doesn't have ACPI PM support
of its own and it simply is turned off when the last LPSS device
having ACPI PM support goes into D3cold. To work around that,
the PM domain used by the ACPI LPSS driver is redesigned so at
least one device with ACPI PM support will be on as long as the
DMA engine is in use. From Andy Shevchenko.
- ACPI backlight driver fix to avoid using it on "Win8-compatible"
systems where it doesn't work and where it was used by default by
mistake (Aaron Lu).
- Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and
Ashwin Chaugule (mostly related to the upcoming ARM64 support).
- Intel RAPL (Running Average Power Limit) power capping driver
fixes and improvements including new processor IDs (Jacob Pan).
- Generic power domains modification to power up domains after
attaching devices to them to meet the expectations of device
drivers and bus types assuming devices to be accessible at
probe time (Ulf Hansson).
- Preliminary support for controlling device clocks from the
generic power domains core code and modifications of the
ARM/shmobile platform to use that feature (Ulf Hansson).
- Assorted minor fixes and cleanups of the generic power
domains core code (Ulf Hansson, Geert Uytterhoeven).
- Assorted minor fixes and cleanups of the device clocks control
code in the PM core (Geert Uytterhoeven, Grygorii Strashko).
- Consolidation of device power management Kconfig options by making
CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
which is now redundant (Rafael J Wysocki and Kevin Hilman). That
is the first batch of the changes needed for this purpose.
- Core device runtime power management support code cleanup related
to the execution of callbacks (Andrzej Hajda).
- cpuidle ARM support improvements (Lorenzo Pieralisi).
- cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and
a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
Bartlomiej Zolnierkiewicz).
- New cpufreq driver callback (->ready) to be executed when the
cpufreq core is ready to use a given policy object and cpufreq-dt
driver modification to use that callback for cooling device
registration (Viresh Kumar).
- cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu,
James Geboski, Tomeu Vizoso).
- Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
Stefan Wahren, Petr Cvek).
- OPP (Operating Performance Points) framework modification to
allow OPPs to be removed too and update of a few cpufreq drivers
(cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
during initialization) on driver removal (Viresh Kumar).
- Hibernation core fixes and cleanups (Tina Ruchandani and
Markus Elfring).
- PM Kconfig fix related to CPU power management (Pankaj Dubey).
- cpupower tool fix (Prarit Bhargava).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUhj6JAAoJEILEb/54YlRxTM4P/j5g5SfqvY0QKsn7sR7MGZ6v
nsgCBhJAqTw3ocNC7EAs8z9h2GWy1KbKpakKYWAh9Fs1yZoey7tFSlcv/Rgjlp70
uU5sDQHtpE9mHKiymdsowiQuWgpl962L4k+k8hUslhlvgk1PvVbpajR6OqG8G+pD
asuIW9eh1APNkLyXmRJ3ZPomzs0VmRdZJ0NEs0lKX9mJskqEvxPIwdaxq3iaJq9B
Fo0J345zUDcJnxWblDRdHlOigCimglElfN5qJwaC4KpwUKuBvLRKbp4f69+wfT0c
kYFiR29X5KjJ2kLfP/wKsLyuDCYYXRq3tCia5M1tAqOjZ+UA89H/GDftx/5lntmv
qUlBa35VfdS1SX4HyApZitOHiLgo+It/hl8Z9bJnhyVw66NxmMQ8JYN2imb8Lhqh
XCLR7BxLTah82AapLJuQ0ZDHPzZqMPG2veC2vAzRMYzVijict/p4Y2+qBqONltER
4rs9uRVn+hamX33lCLg8BEN8zqlnT3rJFIgGaKjq/wXHAU/zpE9CjOrKMQcAg9+s
t51XMNPwypHMAYyGVhEL89ImjXnXxBkLRuquhlmEpvQchIhR+mR3dLsarGn7da44
WPIQJXzcsojXczcwwfqsJCR4I1FTFyQIW+UNh02GkDRgRovQqo+Jk762U7vQwqH+
LBdhvVaS1VW4v+FWXEoZ
=5dox
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"This time we have some more new material than we used to have during
the last couple of development cycles.
The most important part of it to me is the introduction of a unified
interface for accessing device properties provided by platform
firmware. It works with Device Trees and ACPI in a uniform way and
drivers using it need not worry about where the properties come from
as long as the platform firmware (either DT or ACPI) makes them
available. It covers both devices and "bare" device node objects
without struct device representation as that turns out to be necessary
in some cases. This has been in the works for quite a few months (and
development cycles) and has been approved by all of the relevant
maintainers.
On top of that, some drivers are switched over to the new interface
(at25, leds-gpio, gpio_keys_polled) and some additional changes are
made to the core GPIO subsystem to allow device drivers to manipulate
GPIOs in the "canonical" way on platforms that provide GPIO
information in their ACPI tables, but don't assign names to GPIO lines
(in which case the driver needs to do that on the basis of what it
knows about the device in question). That also has been approved by
the GPIO core maintainers and the rfkill driver is now going to use
it.
Second is support for hardware P-states in the intel_pstate driver.
It uses CPUID to detect whether or not the feature is supported by the
processor in which case it will be enabled by default. However, it
can be disabled entirely from the kernel command line if necessary.
Next is support for a platform firmware interface based on ACPI
operation regions used by the PMIC (Power Management Integrated
Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms.
That interface is used for manipulating power resources and for
thermal management: sensor temperature reporting, trip point setting
and so on.
Also the ACPI core is now going to support the _DEP configuration
information in a limited way. Basically, _DEP it supposed to reflect
off-the-hierarchy dependencies between devices which may be very
indirect, like when AML for one device accesses locations in an
operation region handled by another device's driver (usually, the
device depended on this way is a serial bus or GPIO controller). The
support added this time is sufficient to make the ACPI battery driver
work on Asus T100A, but it is general enough to be able to cover some
other use cases in the future.
Finally, we have a new cpufreq driver for the Loongson1B processor.
In addition to the above, there are fixes and cleanups all over the
place as usual and a traditional ACPICA update to a recent upstream
release.
As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for
Intel platforms should be able to handle power management of the DMA
engine correctly, the cpufreq-dt driver should interact with the
thermal subsystem in a better way and the ACPI backlight driver should
handle some more corner cases, among other things.
On top of the ACPICA update there are fixes for race conditions in the
ACPICA's interrupt handling code which might lead to some random and
strange looking failures on some systems.
In the cleanups department the most visible part is the series of
commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration
option. That was triggered by a discussion regarding the generic
power domains code during which we realized that trying to support
certain combinations of PM config options was painful and not really
worth it, because nobody would use them in production anyway. For
this reason, we decided to make CONFIG_PM_SLEEP select
CONFIG_PM_RUNTIME and that lead to the conclusion that the latter
became redundant and CONFIG_PM could be used instead of it. The
material here makes that replacement in a major part of the tree, but
there will be at least one more batch of that in the second part of
the merge window.
Specifics:
- Support for retrieving device properties information from ACPI _DSD
device configuration objects and a unified device properties
interface for device drivers (and subsystems) on top of that. As
stated above, this works with Device Trees and ACPI and allows
device drivers to be written in a platform firmware (DT or ACPI)
agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are
now going to use this new interface and the GPIO subsystem is
additionally modified to allow device drivers to assign names to
GPIO resources returned by ACPI _CRS objects (in case _DSD is not
present or does not provide the expected data). The changes in
this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron
Lu, and Darren Hart with some fixes from others (Fabio Estevam,
Geert Uytterhoeven).
- Support for Hardware Managed Performance States (HWP) as described
in Volume 3, section 14.4, of the Intel SDM in the intel_pstate
driver. CPUID is used to detect whether or not the feature is
supported by the processor. If supported, it will be enabled
automatically unless the intel_pstate=no_hwp switch is present in
the kernel command line. From Dirk Brandewie.
- New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie).
- Support for firmware interface based on ACPI operation regions used
by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR
platforms for power resource control and thermal management (Aaron
Lu).
- Limited support for retrieving off-the-hierarchy dependencies
between devices from ACPI _DEP device configuration objects and
deferred probing support for the ACPI battery driver based on the
_DEP information to make that driver work on Asus T100A (Lan
Tianyu).
- New cpufreq driver for the Loongson1B processor (Kelvin Cheung).
- ACPICA update to upstream revision 20141107 which only affects
tools (Bob Moore).
- Fixes for race conditions in the ACPICA's interrupt handling code
and in the ACPI code related to system suspend and resume (Lv Zheng
and Rafael J Wysocki).
- ACPI core fix for an RCU-related issue in the ioremap() regions
management code that slowed down significantly after CPUs had been
allowed to enter idle states even if they'd had RCU callbakcs
queued and triggered some problems in certain proprietary graphics
driver (and elsewhere). The fix replaces synchronize_rcu() in that
code with synchronize_rcu_expedited() which makes the issue go
away. From Konstantin Khlebnikov.
- ACPI LPSS (Low-Power Subsystem) driver fix to handle power
management of the DMA engine included into the LPSS correctly. The
problem is that the DMA engine doesn't have ACPI PM support of its
own and it simply is turned off when the last LPSS device having
ACPI PM support goes into D3cold. To work around that, the PM
domain used by the ACPI LPSS driver is redesigned so at least one
device with ACPI PM support will be on as long as the DMA engine is
in use. From Andy Shevchenko.
- ACPI backlight driver fix to avoid using it on "Win8-compatible"
systems where it doesn't work and where it was used by default by
mistake (Aaron Lu).
- Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki,
Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin
Chaugule (mostly related to the upcoming ARM64 support).
- Intel RAPL (Running Average Power Limit) power capping driver fixes
and improvements including new processor IDs (Jacob Pan).
- Generic power domains modification to power up domains after
attaching devices to them to meet the expectations of device
drivers and bus types assuming devices to be accessible at probe
time (Ulf Hansson).
- Preliminary support for controlling device clocks from the generic
power domains core code and modifications of the ARM/shmobile
platform to use that feature (Ulf Hansson).
- Assorted minor fixes and cleanups of the generic power domains core
code (Ulf Hansson, Geert Uytterhoeven).
- Assorted minor fixes and cleanups of the device clocks control code
in the PM core (Geert Uytterhoeven, Grygorii Strashko).
- Consolidation of device power management Kconfig options by making
CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter
which is now redundant (Rafael J Wysocki and Kevin Hilman). That
is the first batch of the changes needed for this purpose.
- Core device runtime power management support code cleanup related
to the execution of callbacks (Andrzej Hajda).
- cpuidle ARM support improvements (Lorenzo Pieralisi).
- cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a
new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and
Bartlomiej Zolnierkiewicz).
- New cpufreq driver callback (->ready) to be executed when the
cpufreq core is ready to use a given policy object and cpufreq-dt
driver modification to use that callback for cooling device
registration (Viresh Kumar).
- cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James
Geboski, Tomeu Vizoso).
- Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate,
cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao,
Stefan Wahren, Petr Cvek).
- OPP (Operating Performance Points) framework modification to allow
OPPs to be removed too and update of a few cpufreq drivers
(cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added
during initialization) on driver removal (Viresh Kumar).
- Hibernation core fixes and cleanups (Tina Ruchandani and Markus
Elfring).
- PM Kconfig fix related to CPU power management (Pankaj Dubey).
- cpupower tool fix (Prarit Bhargava)"
* tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits)
i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c
dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
tools: cpupower: fix return checks for sysfs_get_idlestate_count()
drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME
MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
leds: leds-gpio: Fix multiple instances registration without 'label' property
iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
hwrandom / exynos / PM: Use CONFIG_PM in #ifdef
block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
USB / PM: Drop CONFIG_PM_RUNTIME from the USB core
PM: Merge the SET*_RUNTIME_PM_OPS() macros
...
clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
o Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
o The seq_buf code was change to model the seq_file code. I have
a patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was done
to make sure that seq_buf.c is compatible with seq_file.c. I may
try to get that patch in for 3.20.
o The seq_buf.c file was moved to lib/ to remove it from being dependent
on CONFIG_TRACING.
o The printk() was updated to allow for a per_cpu "override" of
the internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow the
NMI to change what printk() does in order to call dump_stack() without
needing to update that code as well.
o Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would wait
till all the NMIs finished, and then it would print the seq_buf
data to the console safely from a non NMI context.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbrnAAoJEEjnJuOKh9ldsCoIAJ3sKIJ5B3jxJJTCHPAx/lZD
GVbV1J1mu4kTAZuhJZOAxW8D6PZGZMyEjg0y6ScDEnBGcjAZ9gTiWCdakPktf9EX
GfaPPqwiL9dZ18J9Qc6uR+7M1Ffpzzwbcc6lJrpoTcjRgkoH9wCiLS9ozFQyYzWb
/7m5UbUM/PIk9WAjLYXPW6UUVtPTPT0RdEQKofMGTeah+vgqj4TXCOROdlxsXXWF
77vqBvPd5TUPWFH9ftzJGDtZS8SroXVKCu3fZIqHgzAU0yqwVtH/JzDTy9u2UYhX
GzDEPeAIdp6m6Uyc406VuIf1QW0gfBgmA0ir80vFoP27uFMM6j5HlF7azgQfx34=
=YBgA
-----END PGP SIGNATURE-----
Merge tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull nmi-safe seq_buf printk update from Steven Rostedt:
"This code is a fork from the trace-3.19 pull as it needed the
trace_seq clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in
practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
- Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
- The seq_buf code was change to model the seq_file code. I have a
patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was
done to make sure that seq_buf.c is compatible with seq_file.c. I
may try to get that patch in for 3.20.
- The seq_buf.c file was moved to lib/ to remove it from being
dependent on CONFIG_TRACING.
- The printk() was updated to allow for a per_cpu "override" of the
internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow
the NMI to change what printk() does in order to call dump_stack()
without needing to update that code as well.
- Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would
wait till all the NMIs finished, and then it would print the
seq_buf data to the console safely from a non NMI context
One added bonus is that this code also makes the NMI dump stack work
on PREEMPT_RT kernels. As printk() includes sleeping locks on
PREEMPT_RT, printk() only writes to console if the console does not
use any rt_mutex converted spin locks. Which a lot do"
* tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
x86/nmi: Fix use of unallocated cpumask_var_t
printk/percpu: Define printk_func when printk is not defined
x86/nmi: Perform a safe NMI stack trace on all CPUs
printk: Add per_cpu printk func to allow printk to be diverted
seq_buf: Move the seq_buf code to lib/
seq-buf: Make seq_buf_bprintf() conditional on CONFIG_BINARY_PRINTF
tracing: Add seq_buf_get_buf() and seq_buf_commit() helper functions
tracing: Have seq_buf use full buffer
seq_buf: Add seq_buf_can_fit() helper function
tracing: Add paranoid size check in trace_printk_seq()
tracing: Use trace_seq_used() and seq_buf_used() instead of len
tracing: Clean up tracing_fill_pipe_page()
seq_buf: Create seq_buf_used() to find out how much was written
tracing: Add a seq_buf_clear() helper and clear len and readpos in init
tracing: Convert seq_buf fields to be like seq_file fields
tracing: Convert seq_buf_path() to be like seq_path()
tracing: Create seq_buf layer in trace_seq
to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to the
seq_file code as well in another tree.
Some of the other goodies include:
o Added some "!" (NOT) logic to the tracing filter.
o Fixed the frame pointer logic to the x86_64 mcount trampolines
o Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated
and be called directly by functions that only have a single hook
to them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbLGAAoJEEjnJuOKh9ldRV4H/3NcLbgGB2iu96la1zdYE6pG
Q7cDJMxXK80YIIL70h9G0IItcD4t62LMb72lfBnMGRj3msgFb3AgISW57EuI0Pxk
xk24wuIPoTG2S7v9sc3SboNFwO8qbtIjxD2OBmqIUrGo2sZIiGjyj3gX7mCY3uzL
WB2bUOSFz/22OgaANinR5EELHA3pZZCf54Vz1K9ndmtK0xp0j1a7xJShD6TrMdYv
mZ3zH5ViIhW4A3mdcMceh6fy2JLQAiEKF0uPTvcMMz7NlVul0mxyL/+10P7AE/3R
Ehw4fzmm4NDshPDtBOkKH0LsppgXzuItFuQUTpact3JlqTg++bV6onSsrkt1hlY=
=Z7Cm
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
Merge first patchbomb from Andrew Morton:
- a few minor cifs fixes
- dma-debug upadtes
- ocfs2
- slab
- about half of MM
- procfs
- kernel/exit.c
- panic.c tweaks
- printk upates
- lib/ updates
- checkpatch updates
- fs/binfmt updates
- the drivers/rtc tree
- nilfs
- kmod fixes
- more kernel/exit.c
- various other misc tweaks and fixes
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
exit: pidns: fix/update the comments in zap_pid_ns_processes()
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
exit: exit_notify: re-use "dead" list to autoreap current
exit: reparent: call forget_original_parent() under tasklist_lock
exit: reparent: avoid find_new_reaper() if no children
exit: reparent: introduce find_alive_thread()
exit: reparent: introduce find_child_reaper()
exit: reparent: document the ->has_child_subreaper checks
exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
exit: proc: don't try to flush /proc/tgid/task/tgid
exit: release_task: fix the comment about group leader accounting
exit: wait: drop tasklist_lock before psig->c* accounting
exit: wait: don't use zombie->real_parent
exit: wait: cleanup the ptrace_reparented() checks
usermodehelper: kill the kmod_thread_locker logic
usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
...
The comments in zap_pid_ns_processes() are not clear, we need to explain
how this code actually works.
1. "Ignore SIGCHLD" looks like optimization but it is not, we also
need this for correctness.
2. The comment above sys_wait4() could tell more.
EXIT_ZOMBIE child is only possible if it has exited before we
ignored SIGCHLD. Or if it is traced from the parent namespace,
but in this case it will be reaped by debugger after detach,
sys_wait4() acts as a synchronization point.
3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
outdated. Contrary to what it says we do not need to make sure
they all go away after 0a01f2cc39 "pidns: Make the pidns proc
mount/umount logic obvious".
At the same time, we do need to wait for nr_hashed==init_pids,
but the reasons are quite different and not obvious: setns().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
fails because disable_pid_allocation() was called by the exiting
child_reaper.
We could simply move get_pid_ns() down to successful return, but this fix
tries to be as trivial as possible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the previous change we can add just the exiting EXIT_DEAD task to
the "dead" list and remove another release_task(tsk).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shift "release dead children" loop from forget_original_parent() to its
caller, exit_notify(). It is safe to reap them even if our parent reaps
us right after we drop tasklist_lock, those children no longer have any
connection to the exiting task.
And this allows us to avoid write_lock_irq(tasklist_lock) right after it
was released by forget_original_parent(), we can simply call it with
tasklist_lock held.
While at it, move the comment about forget_original_parent() up to
this function.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that pid_ns logic was isolated we can change forget_original_parent()
to return right after find_child_reaper() when father->children is empty,
there is nothing to reparent in this case.
In particular this avoids find_alive_thread() and this can help if the
whole process exits and it has a lot of PF_EXITING threads at the start of
the thread list, this can easily lead to O(nr_threads ** 2) iterations.
Trivial test case (tested under KVM, 2 CPUs):
static void *tfunc(void *arg)
{
pause();
return NULL;
}
static int child(unsigned int nt)
{
pthread_t pt;
while (nt--)
assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);
pthread_kill(pt, SIGTRAP);
pause();
return 0;
}
int main(int argc, const char *argv[])
{
int stat;
unsigned int nf = atoi(argv[1]);
unsigned int nt = atoi(argv[2]);
while (nf--) {
if (!fork())
return child(nt);
wait(&stat);
assert(stat == SIGTRAP);
}
return 0;
}
$ time ./test 16 16536 shows:
real user sys
- 5m37.628s 0m4.437s 8m5.560s
+ 0m50.032s 0m7.130s 1m4.927s
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the new simple helper to factor out the for_each_thread() code in
find_child_reaper() and find_new_reaper(). It can also simplify the
potential PF_EXITING -> exit_state change, plus perhaps we can change this
code to take SIGNAL_GROUP_EXIT into account.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() does 2 completely different things. Not only it finds a
reaper, it also updates pid_ns->child_reaper or kills the whole namespace
if the caller is ->child_reaper.
Now that has_child_subreaper logic doesn't depend on child_reaper check we
can move that pid_ns code into a separate helper. IMHO this makes the
code more clean, and this allows the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap the "init_task" and same_thread_group() checks. This way it is more
simple to document these checks and we can remove the link to the previous
discussion on lkml.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change find_new_reaper() to use for_each_thread() instead of deprecated
while_each_thread(). We do not bother to check "thread != father" in the
1st loop, we can rely on PF_EXITING check.
Note: this means the minor behavioural change: for_each_thread() starts
from the group leader. But this should be fine, nobody should make any
assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
And this can avoid the pointless reparenting to a short-living thread
While zombie leaders are not that common.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:
1. In fact it is safe if "pid_ns->child_reaper == father"; there must
be no children after zap_pid_ns_processes() returns, so it doesn't
matter what we return in this case and even pid_ns->child_reaper is
wrong otherwise: we can't reparent to ->child_reaper == current.
This is not a bug, but this is confusing.
2. It is not safe if we are not pid_ns->child_reaper but from the same
thread group. We drop tasklist_lock before zap_pid_ns_processes(),
so another thread can lock it and choose the new reaper from the
upper namespace if has_child_subreaper == T, and this is obviously
wrong.
This is not that bad, zap_pid_ns_processes() won't return until the
the new reaper reaps all zombies, but this should be fixed anyway.
We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac62706a, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.
We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
but returns another "reaper" thread which can be dead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Contrary to what the comment in __exit_signal() says we do account the
group leader. Fix this and explain why.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_task_zombie() no longer needs tasklist_lock to accumulate the
psig->c* counters, we can drop it right after cmpxchg(exit_state).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is
correct but needs tasklist_lock, ->real_parent can exit.
We can use "current" instead. This is our natural child, its parent
must be our sub-thread.
2. Read psig/sig outside of ->siglock, ->signal is no longer protected
by this lock.
3. Fix the outdated comments about tasklist_lock. We can not race with
__exit_signal(), the whole thread group is dead, nobody but us can
call it.
Also clarify the usage of ->stats_lock and ->siglock.
Note: thread_group_cputime_adjusted() is sub-optimal in this case, we
probably want to export cputime_adjust() to avoid thread_group_cputime().
The comment says "all threads" but there are no other threads.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that EXIT_DEAD is the terminal state we can kill "int traced"
variable and check "state == EXIT_DEAD" instead to cleanup the code. In
particular, this way it is clear that the check obviously doesn't need
tasklist_lock.
Also fix the type of "unsigned long state", "long" was always wrong
although this doesn't matter because cmpxchg/xchg uses typeof(*ptr).
[akpm@linux-foundation.org: don't make me google the C Operator Precedence table]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we do not call kernel_thread(CLONE_VFORK) from the worker
thread we can not deadlock if do_execve() in turn triggers another
call_usermodehelper(), we can remove the kmod_thread_locker code.
Note: we should probably kill khelper_wq and simply use one of the
global workqueues, say, system_unbound_wq, this special wq for umh buys
nothing nowadays.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After "kernel/kmod: fix use-after-free of the sub_infostructure"
CLONE_VFORK in __call_usermodehelper() buys nothing, we rely on on
umh_complete() in ____call_usermodehelper() anyway.
Remove it. This also eliminates the unnecessary sleep/wakeup in the
likely case, and this allows the next change.
While at it, kill the "int wait" locals in ____call_usermodehelper() and
__call_usermodehelper(), they can safely use sub_info->wait.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pranith Kumar posted a patch in which removed the "volatile"
qualifier for the "logbuf_cpu" variable in vprintk_emit().
https://lkml.org/lkml/2014/11/13/894
In his patch, he used ACCESS_ONCE() for all references to
that symbol to provide whatever protection was intended.
There was some discussion that followed, and in the end Steven Rostedt
concluded that not only was "volatile" not needed, neither was it
required to use ACCESS_ONCE(). I offered an elaborate description that
concluded Steven was right, and Pranith asked me to submit an
alternative patch. And this is it.
The basic reason "volatile" is not needed is that "logbuf_cpu" has
static storage duration, and vprintk_emit() is an exported
interface. This means that the value of logbuf_cpu must be read
from memory the first time it is used in a particular call of
vprintk_emit(). The variable's value is read only once in that
function, when it's read it'll be the copy from memory (or cache).
In addition, the value of "logbuf_cpu" is only ever written under
protection of a spinlock. So the value that is read is the "real"
value (and not an out-of-date cached one). If its value is not
UINT_MAX, it is the current CPU's processor id, and it will have
been last written by the running CPU.
Signed-off-by: Alex Elder <elder@linaro.org>
Reported-by: Pranith Kumar <bobby.prani@gmail.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eliminate the unlikely possibility of message interleaving for
early_printk/early_vprintk use.
early_vprintk can be done via the %pV extension so remove this
unnecessary function and change early_printk to have the equivalent
vprintk code.
All uses of early_printk already end with a newline so also remove the
unnecessary newline from the early_printk function.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system. Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to
the user.
A much easier method would be a switch to change the WARN() over to a
panic. This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.
This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the
warn_slowpath_common() path. The function will still print out the
location of the warning.
An example of the panic_on_warn output:
The first line below is from the WARN_ON() to output the WARN_ON()'s
location. After that the panic() output is displayed.
WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...
CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
Call Trace:
[<ffffffff81665190>] dump_stack+0x46/0x58
[<ffffffff8165e2ec>] panic+0xd0/0x204
[<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
[<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
[<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
[<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
[<ffffffff81002144>] do_one_initcall+0xd4/0x210
[<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
[<ffffffff810f8889>] load_module+0x16a9/0x1b30
[<ffffffff810f3d30>] ? store_uevent+0x70/0x70
[<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
[<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
[<ffffffff8166cf29>] system_call_fastpath+0x12/0x17
Successfully tested by me.
hpa said: There is another very valid use for this: many operators would
rather a machine shuts down than being potentially compromised either
functionally or security-wise.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.
Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Now that reparent_leader() doesn't abuse ->sibling we can shift
list_move_tail() from reparent_leader() to forget_original_parent()
and turn it into a single list_splice_tail_init(). This also makes
BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.
2. This also allows to shift the same_thread_group() check, it looks
a bit more clear in the caller.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
We need to change t->parent if and only if t is not traced.
2. If we actually want this BUG_ON() to ensure that parent/ptrace
match each other, then we should also take ptrace_reparented()
case into account too.
3. Change this code to use for_each_thread() instead of deprecated
while_each_thread().
[dan.carpenter@oracle.com: silence a bogus static checker warning]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
into dead_children list we are going to release. This obviously removes
the dead task from its real_parent->children list and this is even good;
the parent can do nothing with the EXIT_DEAD reparented zombie, it only
makes do_wait() slower.
But, this also means that it can not be reparented once again, so if its
new parent dies too nobody will update ->parent/real_parent, they can
point to the freed memory even before release_task() we are going to call,
this breaks the code which relies on pid_alive() to access
->real_parent/parent.
Fortunately this is mostly theoretical, this can only happen if init or
PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
sub-thread exits right after we drop tasklist_lock.
Change this code to use ->ptrace_entry instead, we know that the child is
not traced so nobody can ever use this member. This also allows to unify
this logic with exit_ptrace(), see the next changes.
Note: we really need to change release_task() to nullify real_parent/
parent/group_leader pointers, but we need to change the current users
first somehow. And it would be better to reap this zombie immediately but
release_task_locked() we need is complicated by proc_flush_task().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.
Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All memory accounting and limiting has been switched over to the
lockless page counters. Bye, res_counter!
[akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
[mhocko@suse.cz: ditch the last remainings of res_counter]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:
- unification of d_splice_alias()/d_materialize_unique()
- iov_iter rewrite
- killing a bunch of ->f_path.dentry users (and f_dentry macro).
Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.
Still not complete, but much closer now.
- crapectomy in lustre (dead code removal, mostly)
- "let's make seq_printf return nothing" preparations
- assorted cleanups and fixes
There _definitely_ will be more piles"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...
side-step that by reading copies that pstore saved.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUheJgAAoJEKurIx+X31iB5F0P/jdpAw6cI26icGiOcRvRYvce
jLq/WbGggxZlx3rtgGpekJmcJ1NBBTLdyx4b86q4q/zstQkoJ9lqGCn63YcIMJNB
pdctmbkGyoQQXBTAzSCFs6pybMUmtYKMDiT3OJddcCm4fUjd4RQHvNP+5ESsf0lQ
9YpIS+rZOtB2/5N6/i4+Lnaffc3s5gXw/dJMxOm/laWtRFRyhf22YP18cRp5LmuV
NHqu1NoeLnar/qL6plPl73lEyZVOPRC01T7OWmmCkcLieYPGkqQlkoXp95VBKf5u
CvD167oM71OccMa0gOTlCS8a6y5KO6y8I+YAR60iANTLDh+rHZiwNj1gY4v/Z29m
2ba1xAulQrpCxqml6eVxAKaF+4HXaXVXKqjQIivJcGyfYf6BXLMvC0M3Lsv7XQdz
HKl++o0JELDEJjVW0i9Wa5CjgcqXdvuRXOoKDaKTZWff2yfUxqIN5Xl7zIV2kgVy
ZqPDBHJSmHjuzmJ6inhPkmdS2uz94PVSE7ykeaa8iCBbpdsS+FchtF2sRMvUhU23
ekHsxk0Mk/pS5EBNc6rrrM9NtKrUQMa1e/oT5G7QowksDeNpsPjx92OeUImxgh3x
+hmObN9vx6SepwVSfjI1rwrMsAknphJfPmyi/XJgkVbfRMCv2we1npvYd6hqFUMV
daekMzGOi5eqoaWB8hje
=Ezg0
-----END PGP SIGNATURE-----
Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore fixes from Tony Luck:
"On a system that restricts access to dmesg, don't let people side-step
that by reading copies that pstore saved"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
syslog: Provide stub check_syslog_permissions
pstore: Honor dmesg_restrict sysctl on dmesg dumps
pstore/ram: Strip ramoops header for correct decompression
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more 2038 timer work from Thomas Gleixner:
"Two more patches for the ongoing 2038 work:
- New accessors to clock MONOTONIC and REALTIME seconds
This is a seperate branch as Arnd has follow up work depending on
this"
* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME
timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
Pull x86 MPX support from Thomas Gleixner:
"This enables support for x86 MPX.
MPX is a new debug feature for bound checking in user space. It
requires kernel support to handle the bound tables and decode the
bound violating instruction in the trap handler"
* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
asm-generic: Remove asm-generic arch_bprm_mm_init()
mm: Make arch_unmap()/bprm_mm_init() available to all architectures
x86: Cleanly separate use of asm-generic/mm_hooks.h
x86 mpx: Change return type of get_reg_offset()
fs: Do not include mpx.h in exec.c
x86, mpx: Add documentation on Intel MPX
x86, mpx: Cleanup unused bound tables
x86, mpx: On-demand kernel allocation of bounds tables
x86, mpx: Decode MPX instruction to get bound violation information
x86, mpx: Add MPX-specific mmap interface
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: Add MPX to disabled features
ia64: Sync struct siginfo with general version
mips: Sync struct siginfo with general version
mpx: Extend siginfo structure to include bound violation information
x86, mpx: Rename cfg_reg_u and status_reg
x86: mpx: Give bndX registers actual names
x86: Remove arbitrary instruction size limit in instruction decoder
Pull irq domain updates from Thomas Gleixner:
"The real interesting irq updates:
- Support for hierarchical irq domains:
For complex interrupt routing scenarios where more than one
interrupt related chip is involved we had no proper representation
in the generic interrupt infrastructure so far. That made people
implement rather ugly constructs in their nested irq chip
implementations. The main offenders are x86 and arm/gic.
To distangle that mess we have now hierarchical irqdomains which
seperate the various interrupt chips and connect them via the
hierarchical domains. That keeps the domain specific details
internal to the particular hierarchy level and removes the
criss/cross referencing of chip internals. The resulting hierarchy
for a complex x86 system will look like this:
vector mapped: 74
msi-0 mapped: 2
dmar-ir-1 mapped: 69
ioapic-1 mapped: 4
ioapic-0 mapped: 20
pci-msi-2 mapped: 45
dmar-ir-0 mapped: 3
ioapic-2 mapped: 1
pci-msi-1 mapped: 2
htirq mapped: 0
Neither ioapic nor pci-msi know about the dmar interrupt remapping
between themself and the vector domain. If interrupt remapping is
disabled ioapic and pci-msi become direct childs of the vector
domain.
In hindsight we should have done that years ago, but in hindsight
we always know better :)
- Support for generic MSI interrupt domain handling
We have more and more non PCI related MSI interrupts, so providing
a generic infrastructure for this is better than having all
affected architectures implementing their own private hacks.
- Support for PCI-MSI interrupt domain handling, based on the generic
MSI support.
This part carries the pci/msi branch from Bjorn Helgaas pci tree to
avoid a massive conflict. The PCI/MSI parts are acked by Bjorn.
I have two more branches on top of this. The full conversion of x86
to hierarchical domains and a partial conversion of arm/gic"
* 'irq-irqdomain-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
genirq: Move irq_chip_write_msi_msg() helper to core
PCI/MSI: Allow an msi_controller to be associated to an irq domain
PCI/MSI: Provide mechanism to alloc/free MSI/MSIX interrupt from irqdomain
PCI/MSI: Enhance core to support hierarchy irqdomain
PCI/MSI: Move cached entry functions to irq core
genirq: Provide default callbacks for msi_domain_ops
genirq: Introduce msi_domain_alloc/free_irqs()
asm-generic: Add msi.h
genirq: Add generic msi irq domain support
genirq: Introduce callback irq_chip.irq_write_msi_msg
genirq: Work around __irq_set_handler vs stacked domains ordering issues
irqdomain: Introduce helper function irq_domain_add_hierarchy()
irqdomain: Implement a method to automatically call parent domains alloc/free
genirq: Introduce helper irq_domain_set_info() to reduce duplicated code
genirq: Split out flow handler typedefs into seperate header file
genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchip
genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchip
genirq: Add more helper functions to support stacked irq_chip
genirq: Introduce helper functions to support stacked irq_chip
irqdomain: Do irq_find_mapping and set_type for hierarchy irqdomain in case OF
...
Pull irq core updates from Thomas Gleixner:
"This is the first (boring) part of irq updates:
- support for big endian I/O accessors in the generic irq chip
- cleanup of brcmstb/bcm7120 drivers so they can be reused for non
ARM SoCs
- the usual pile of fixes and updates for the various ARM irq chips"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
irqchip: dw-apb-ictl: Add PM support
irqchip: dw-apb-ictl: Enable IRQ_GC_MASK_CACHE_PER_TYPE
irqchip: dw-apb-ictl: Always use use {readl|writel}_relaxed
ARM: orion: convert the irq_reg_{readl,writel} calls to the new API
irqchip: atmel-aic: Add missing entry for rm9200 irq fixups
irqchip: atmel-aic: Rename at91sam9_aic_irq_fixup for naming consistency
irqchip: atmel-aic: Add specific irq fixup function for sam9g45 and sam9rl
irqchip: atmel-aic: Add irq fixups for at91sam926x SoCs
irqchip: atmel-aic: Add irq fixup for RTT block
irqchip: brcmstb-l2: Convert driver to use irq_reg_{readl,writel}
irqchip: bcm7120-l2: Convert driver to use irq_reg_{readl,writel}
irqchip: bcm7120-l2: Decouple driver from brcmstb-l2
irqchip: bcm7120-l2: Extend driver to support 64+ bit controllers
irqchip: bcm7120-l2: Use gc->mask_cache to simplify suspend/resume functions
irqchip: bcm7120-l2: Fix missing nibble in gc->unused mask
irqchip: bcm7120-l2: Make sure all register accesses use base+offset
irqchip: bcm7120-l2, brcmstb-l2: Remove ARM Kconfig dependency
irqchip: bcm7120-l2: Eliminate bad IRQ check
irqchip: brcmstb-l2: Eliminate dependency on ARM code
genirq: Generic chip: Add big endian I/O accessors
...
Pull timer core updates from Thomas Gleixner:
"The time(r) departement provides:
- more infrastructure work on the year 2038 issue
- a few fixes in the Armada SoC timers
- the usual pile of fixlets and improvements"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: armada-370-xp: Use the reference clock on A375 SoC
watchdog: orion: Use the reference clock on Armada 375 SoC
clocksource: armada-370-xp: Add missing clock enable
time: Fix sign bug in NTP mult overflow warning
time: Remove timekeeping_inject_sleeptime()
rtc: Update suspend/resume timing to use 64bit time
rtc/lib: Provide y2038 safe rtc_tm_to_time()/rtc_time_to_tm() replacement
time: Fixup comments to reflect usage of timespec64
time: Expose get_monotonic_coarse64() for in-kernel uses
time: Expose getrawmonotonic64 for in-kernel uses
time: Provide y2038 safe mktime() replacement
time: Provide y2038 safe timekeeping_inject_sleeptime() replacement
time: Provide y2038 safe do_settimeofday() replacement
time: Complete NTP adjustment threshold judging conditions
time: Avoid possible NTP adjustment mult overflow.
time: Rename udelay_test.c to test_udelay.c
clocksource: sirf: Remove hard-coded clock rate
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.
This instruments might_sleep() checks to catch places that nest
blocking primitives - such as mutex usage in a wait loop. Such
bugs can result in hard to debug races/hangs.
Another category of invalid nesting that this facility will detect
is the calling of blocking functions from within schedule() ->
sched_submit_work() -> blk_schedule_flush_plug().
There's some potential for false positives (if secondary blocking
primitives themselves are not ready yet for this facility), but the
kernel will warn once about such bugs per bootup, so the warning
isn't much of a nuisance.
This feature comes with a number of fixes, for problems uncovered
with it, so no messages are expected normally.
- Another round of sched/numa optimizations and refinements, for
CONFIG_NUMA_BALANCING=y.
- Another round of sched/dl fixes and refinements.
Plus various smaller fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched: Add missing rcu protection to wake_up_all_idle_cpus
sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
sched/numa: Init numa balancing fields of init_task
sched/deadline: Remove unnecessary definitions in cpudeadline.h
sched/cpupri: Remove unnecessary definitions in cpupri.h
sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
sched/fair: Fix stale overloaded status in the busiest group finding logic
sched: Move p->nr_cpus_allowed check to select_task_rq()
sched/completion: Document when to use wait_for_completion_io_*()
sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
sched/fair: Kill task_struct::numa_entry and numa_group::task_list
sched: Refactor task_struct to use numa_faults instead of numa_* pointers
sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
sched/deadline: Reschedule from switched_from_dl() after a successful pull
sched/deadline: Push task away if the deadline is equal to curr during wakeup
sched/deadline: Add deadline rq status print
sched/deadline: Fix artificial overrun introduced by yield_task_dl()
sched/rt: Clean up check_preempt_equal_prio()
sched/core: Use dl_bw_of() under rcu_read_lock_sched()
sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
...
Pull perf events update from Ingo Molnar:
"On the kernel side there's few changes, the one that stands out is
PEBS machine state sampling support on x86, by Stephane Eranian.
On the tooling side:
User visible tooling changes:
- Don't open the DWARF info multiple times, keeping instead a dwfl
handle in struct dso, greatly speeding up 'perf report' on powerpc.
(Sukadev Bhattiprolu)
- Introduce PARSE_OPT_DISABLED option flag and use it to avoid
showing undersired options in tools that provides frontends to
'perf record', like sched, kvm, etc (Namhyung Kim)
- Fallback to kallsyms when using the minimal 'ELF' loader (Arnaldo
Carvalho de Melo)
- Fix annotation with kcore (Adrian Hunter)
- Support source line numbers in annotate using a hotkey (Andi Kleen)
- Callchain improvements including:
* Enable printing the srcline in the history
* Make get_srcline fall back to sym+offset (Andi Kleen)
- TUI hist_entry browser fixes, including showing missing overhead
value for first level callchain. Detected comparing the output of
--stdio/--gui (that matched) with --tui, that had this problem.
(Namhyung Kim)
- Support handling complete branch stacks as histograms (Andi Kleen)
Tooling infrastructure changes:
- Prep work for supporting per-pkg and snapshot counters in 'perf
stat' (Jiri Olsa)
- 'perf stat' refactorings, moving stuff from it to evsel.c to use in
per-pkg/snapshot format changes (Jiri Olsa)
- Add per-pkg format file parsing (Matt Fleming)
- Clean up libelf feature support code (Namhyung Kim)
- Add gzip decompression support for kernel modules (Namhyung Kim)
- More prep patches for Intel PT, including a a thread stack and more
stuff made available via the database export mechanism (Adrian
Hunter)
- More Intel PT work, including a facility to export sample data
(comms, threads, symbol names, etc) in a database friendly way,
with an script to use this to create a postgresql database.
(Adrian Hunter)
- Make sure that thread->mg->machine points to the machine where the
thread exists (it was being set only for the kmaps kernel modules
case, do it as well for the mmaps) and use it to shorten function
signatures (Arnaldo Carvalho de Melo)
... and lots of other fixes and smaller improvements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
perf report: In branch stack mode use address history sorting
perf report: Add --branch-history option
perf callchain: Support handling complete branch stacks as histograms
perf stat: Add support for snapshot counters
perf stat: Add support for per-pkg counters
perf tools: Remove perf_evsel__read interface
perf stat: Use read_counter in read_counter_aggr
perf stat: Make read_counter work over the thread dimension
perf stat: Use perf_evsel__read_cb in read_counter
perf tools: Add snapshot format file parsing
perf tools: Add per-pkg format file parsing
perf evsel: Introduce perf_evsel__read_cb function
perf evsel: Introduce perf_counts_values__scale function
perf evsel: Introduce perf_evsel__compute_deltas function
perf tools: Allow to force redirect pr_debug to stderr.
perf tools: Fix segfault due to invalid kernel dso access
perf callchain: Make get_srcline fall back to sym+offset
perf symbols: Move bfd_demangle stubbing to its only user
perf callchain: Enable printing the srcline in the history
perf tools: Collapse first level callchain entry if it has sibling
...
Pull RCU updates from Ingo Molnar:
"These are the main changes in this cycle:
- Streamline RCU's use of per-CPU variables, shifting from "cpu"
arguments to functions to "this_"-style per-CPU variable
accessors.
- signal-handling RCU updates.
- real-time updates.
- torture-test updates.
- miscellaneous fixes.
- documentation updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
rcu: Fix FIXME in rcu_tasks_kthread()
rcu: More info about potential deadlocks with rcu_read_unlock()
rcu: Optimize cond_resched_rcu_qs()
rcu: Add sparse check for RCU_INIT_POINTER()
documentation: memory-barriers.txt: Correct example for reorderings
documentation: Add atomic_long_t to atomic_ops.txt
documentation: Additional restriction for control dependencies
documentation: Document RCU self test boot params
rcutorture: Fix rcu_torture_cbflood() memory leak
rcutorture: Remove obsolete kversion param in kvm.sh
rcutorture: Remove stale test configurations
rcutorture: Enable RCU self test in configs
rcutorture: Add early boot self tests
torture: Run Linux-kernel binary out of results directory
cpu: Avoid puts_pending overflow
rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
rcu: Remove "cpu" argument to rcu_prepare_for_idle()
rcu: Remove "cpu" argument to rcu_needs_cpu()
rcu: Remove "cpu" argument to rcu_note_context_switch()
rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
...
Generalize id_map_mutex so it can be used for more state of a user namespace.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If you did not create the user namespace and are allowed
to write to uid_map or gid_map you should already have the necessary
privilege in the parent user namespace to establish any mapping
you want so this will not affect userspace in practice.
Limiting unprivileged uid mapping establishment to the creator of the
user namespace makes it easier to verify all credentials obtained with
the uid mapping can be obtained without the uid mapping without
privilege.
Limiting unprivileged gid mapping establishment (which is temporarily
absent) to the creator of the user namespace also ensures that the
combination of uid and gid can already be obtained without privilege.
This is part of the fix for CVE-2014-8989.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
setresuid allows the euid to be set to any of uid, euid, suid, and
fsuid. Therefor it is safe to allow an unprivileged user to map
their euid and use CAP_SETUID privileged with exactly that uid,
as no new credentials can be obtained.
I can not find a combination of existing system calls that allows setting
uid, euid, suid, and fsuid from the fsuid making the previous use
of fsuid for allowing unprivileged mappings a bug.
This is part of a fix for CVE-2014-8989.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
As any gid mapping will allow and must allow for backwards
compatibility dropping groups don't allow any gid mappings to be
established without CAP_SETGID in the parent user namespace.
For a small class of applications this change breaks userspace
and removes useful functionality. This small class of applications
includes tools/testing/selftests/mount/unprivilged-remount-test.c
Most of the removed functionality will be added back with the addition
of a one way knob to disable setgroups. Once setgroups is disabled
setting the gid_map becomes as safe as setting the uid_map.
For more common applications that set the uid_map and the gid_map
with privilege this change will have no affect.
This is part of a fix for CVE-2014-8989.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.
The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.
This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.
Cc: stable@vger.kernel.org
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently, blktrace can be started/stopped via its ioctl-based interface
(used by the userspace blktrace tool) or via its ftrace interface. The
function blk_trace_remove_queue(), called each time an "enable" tunable
of the ftrace interface transitions to zero, removes the trace from the
running list, even if no function from the sysfs interface adds it to
such a list. This leads to a null pointer dereference. This commit
changes the blk_trace_remove_queue() function so that it does not remove
the blk_trace from the running list.
v2:
- Now the patch removes the invocation of list_del() instead of
adding an useless if branch, as suggested by Namhyung Kim.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
* pm-domains:
ARM: shmobile: Convert to genpd flags for PM clocks for R-mobile
ARM: shmobile: Convert to genpd flags for PM clocks for r8a7779
PM / Domains: Initial PM clock support for genpd
PM / Domains: Power on the PM domain right after attach completes
PM / Domains: Move struct pm_domain_data to pm_domain.h
PM / Domains: Extract code to power off/on a PM domain
PM / Domains: Make genpd parameter of pm_genpd_present() const
* pm-sleep:
PM / hibernate: Deletion of an unnecessary check before the function call "vfree"
PM / Hibernate: Migrate to ktime_t
* pm-tools:
tools: cpupower: fix return checks for sysfs_get_idlestate_count()
* powercap:
powercap / RAPL: fix build dependency on iosf_mbi
powercap / RAPL: add new model ids
powercap / RAPL: handle atom and core differences
powercap / RAPL: abstract per cpu type functions
* pm-clk:
PM / clock_ops: make __pm_clk_enable more generic
PM / clock_ops: Add pm_clk_add_clk()
* pm-config:
PM: Kconfig: fix unmet dependency for CPU_PM
* pm-opp:
PM / OPP replace kfree_rcu() with call_srcu() in opp_set_availability()
PM / OPP Introduce APIs to remove OPPs
PM / OPP mark OPPs as 'static' or 'dynamic'
PM / OPP don't match for existing OPPs when list is empty
PM / OPP rename 'head' as 'rcu_head' or 'srcu_head' based on its type
When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.
In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.
The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.
The rescuer will be woken again in 100ms to handle another max_requests
requests.
I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests. When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).
With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance. This allows
all requests to be handled in a timely fashion.
We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.
I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread. In that context
it significantly improves performance. A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.
Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.
tj: Inverted the lock order between wq_mayday_lock and pool->lock with
a preceding patch and simplified this patch. Added comment and
updated changelog accordingly. Dongsu spotted missing get_pwq()
in the simplified code.
Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, pool->lock nests inside pool->lock. There's no inherent
reason for this order. The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.
This nesting order turns out to complicate things with the planned
rescuer_thread() update. Let's invert them. This doesn't cause any
behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Dongsu Park <dongsu.park@profitbricks.com>
Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection. The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.
Fixes: f6be8af1c9 ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Chuansheng Liu <chuansheng.liu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhNLZAAoJEHm+PkMAQRiGAEcH/iclYDW7k2GKemMqboy+Ohmh
+ELbQothNhlGZlS1wWdD69LBiiXkkQ+ufVYFh/hC0oy0gUdfPMt5t+bOHy6cjn6w
9zOcACtpDKnqbOwRqXZjZgNmIabk7lRjbn7GK4GQqpIaW4oO0FWcT91FFhtGSPDa
tjtmGRqDmbNsqfzr18h0WPEpUZmT6MxIdv17AYDliPB1MaaRuAv1Kss05TJrXdfL
Oucv+C0uwnybD9UWAz6pLJ3H/HR9VJFdkaJ4Y0pbCHAuxdd1+swoTpicluHlsJA1
EkK5iWQRMpcmGwKvB0unCAQljNpaJiq4/Tlmmv8JlYpMlmIiVLT0D8BZx5q05QQ=
=oGNw
-----END PGP SIGNATURE-----
Merge tag 'v3.18' into drm-next
Linux 3.18
Backmerge Linus tree into -next as we had conflicts in i915/radeon/nouveau,
and everyone was solving them individually.
* tag 'v3.18': (57 commits)
Linux 3.18
watchdog: s3c2410_wdt: Fix the mask bit offset for Exynos7
uapi: fix to export linux/vm_sockets.h
i2c: cadence: Set the hardware time-out register to maximum value
i2c: davinci: generate STP always when NACK is received
ahci: disable MSI on SAMSUNG 0xa800 SSD
context_tracking: Restore previous state in schedule_user
slab: fix nodeid bounds check for non-contiguous node IDs
lib/genalloc.c: export devm_gen_pool_create() for modules
mm: fix anon_vma_clone() error treatment
mm: fix swapoff hang after page migration and fork
fat: fix oops on corrupted vfat fs
ipc/sem.c: fully initialize sem_array before making it visible
drivers/input/evdev.c: don't kfree() a vmalloc address
cxgb4: Fill in supported link mode for SFP modules
xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
mm/vmpressure.c: fix race in vmpressure_work_fn()
mm: frontswap: invalidate expired data on a dup-store failure
mm: do not overwrite reserved pages counter at show_mem()
drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
...
Conflicts:
drivers/gpu/drm/i915/intel_display.c
drivers/gpu/drm/nouveau/nouveau_drm.c
drivers/gpu/drm/radeon/radeon_cs.c
No point to expose this to the world. The only legitimate user is the
core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
introduce program type BPF_PROG_TYPE_SOCKET_FILTER that is used
for attaching programs to sockets where ctx == skb.
add verifier checks for ABS/IND instructions which can only be seen
in socket filters, therefore the check:
if (env->prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER)
verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rule is simple. Don't allow anything that wouldn't be allowed
without unprivileged mappings.
It was previously overlooked that establishing gid mappings would
allow dropping groups and potentially gaining permission to files and
directories that had lesser permissions for a specific group than for
all other users.
This is the rule needed to fix CVE-2014-8989 and prevent any other
security issues with new_idmap_permitted.
The reason for this rule is that the unix permission model is old and
there are programs out there somewhere that take advantage of every
little corner of it. So allowing a uid or gid mapping to be
established without privielge that would allow anything that would not
be allowed without that mapping will result in expectations from some
code somewhere being violated. Violated expectations about the
behavior of the OS is a long way to say a security issue.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today there are 3 instances of setgroups and due to an oversight their
permission checking has diverged. Add a common function so that
they may all share the same permission checking code.
This corrects the current oversight in the current permission checks
and adds a helper to avoid this in the future.
A user namespace security fix will update this new helper, shortly.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We've lost the +1 required for correct timeouts in
commit 5ed0bdf21a
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Wed Jul 16 21:05:06 2014 +0000
drm: i915: Use nsec based interfaces
Use ktime_get_raw_ns() and get rid of the back and forth timespec
conversions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: John Stultz <john.stultz@linaro.org>
So fix this up by reinstating our handrolled _timeout function. While
at it bother with handling MAX_JIFFIES.
v2: Convert to usecs (we don't care about the accuracy anyway) first
to avoid overflow issues Dave Gordon spotted.
v3: Drop the explicit MAX_JIFFY_OFFSET check, usecs_to_jiffies should
take care of that already. It might be a bit too enthusiastic about it
though.
v4: Chris has a much nicer color, so use his implementation.
This requires to export nsec_to_jiffies from time.c.
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82749
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.
That way struct proc_ns becomes invisible outside of fs/proc/*.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for now - just move corresponding ->proc_inum instances over there
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience. There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly. Replace it with the local variable.
This patch causes no functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context. This causes RCU
warnings and possible failures.
This is intended to be a minimal fix suitable for 3.18.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so quite a few
depend on CONFIG_PM or even may be dropped entirely in some cases.
Replace CONFIG_PM_RUNTIME with CONFIG_PM in the PM core code.
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The initial reason for this patch is that I noticed that:
if (len > TRACE_BUF_SIZE)
is off by one. In this code, if len == TRACE_BUF_SIZE, then it means we
have truncated the last character off the output string. If we truncate
two or more characters then we exit without printing.
After some discussion, we decided that printing truncated data is better
than not printing at all so we should just use vscnprintf() and remove
the test entirely. Also I have updated memcpy() to copy the NUL char
instead of setting the NUL in a separate step.
Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, function graph tracer prints "!" or "+" just before
function execution time to signal a function overhead, depending
on the time. And some tracers tracing latency also print "!" or
"+" just after time to signal overhead, depending on the interval
between events. Even it is usually enough to do that, we sometimes
need to signal for bigger execution time than 100 micro seconds.
For example, I used function graph tracer to detect if there is
any case that exit_mm() takes too much time. I did following steps
in /sys/kernel/debug/tracing. It was easier to detect very large
excution time with patched kernel than with original kernel.
$ echo exit_mm > set_graph_function
$ echo function_graph > current_tracer
$ echo > trace
$ cat trace_pipe > $LOGFILE
... (do something and terminate logging)
$ grep "\\$" $LOGFILE
3) $ 22082032 us | } /* kernel_map_pages */
3) $ 22082040 us | } /* free_pages_prepare */
3) $ 22082113 us | } /* free_hot_cold_page */
3) $ 22083455 us | } /* free_hot_cold_page_list */
3) $ 22083895 us | } /* release_pages */
3) $ 22177873 us | } /* free_pages_and_swap_cache */
3) $ 22178929 us | } /* unmap_single_vma */
3) $ 22198885 us | } /* unmap_vmas */
3) $ 22206949 us | } /* exit_mmap */
3) $ 22207659 us | } /* mmput */
3) $ 22207793 us | } /* exit_mm */
And then, it was easy to find out that a schedule-out occured by
sub_preempt_count() within kernel_map_pages().
To detect very large function exection time caused by either problematic
function implementation or scheduling issues, this patch can be useful.
Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support to allow not "!" for and (&&) and (||). That is:
!(field1 == X && field2 == Y)
Where the value of the full clause will be notted.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ted noticed that he could not filter on an event for a bit being cleared.
That's because the filtering logic only tests event fields with a limited
number of comparisons which, for bit logic, only include "&", which can
test if a bit is set, but there's no good way to see if a bit is clear.
This adds a way to do: !(field & 2048)
Which returns true if the bit is not set, and false otherwise.
Note, currently !(field1 == 10 && field2 == 15) is not supported.
That is, the 'not' only works for direct comparisons, not for the
AND and OR logic.
Link: http://lkml.kernel.org/r/20141202021912.GA29096@thunk.org
Link: http://lkml.kernel.org/r/20141202120430.71979060@gandalf.local.home
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In commit 6067dc5a8c ("time: Avoid possible NTP adjustment
mult overflow") a new check was added to watch for adjustments
that could cause a mult overflow.
Unfortunately the check compares a signed with unsigned value
and ignored the case where the adjustment was negative, which
causes spurious warn-ons on some systems (and seems like it
would result in problematic time adjustments there as well, due
to the early return).
Thus this patch adds a check to make sure the adjustment is
positive before we check for an overflow, and resovles the issue
in my testing.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Debugged-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1416890145-30048-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns. I suspect that this is a mistake and that
the code only works because int3 is paranoid.
Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916df 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.
Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine. He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.
What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.
While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()
do_task_stat(()
thread_group_cputime_adjusted()
thread_group_cputime()
task_cputime()
task_sched_runtime()
if (task_current(rq, p) && task_on_rq_queued(p)) {
update_rq_clock(rq);
up->sched_class->update_curr(rq);
}
If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr. Ooops.
Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.
Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task. While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.
Fixes: 6e998916df 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason <clm@fb.com>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Required to support non PCI based MSI.
[ tglx: Extracted from Jiangs patch series ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement the basic functions for MSI interrupt support with
hierarchical interrupt domains.
[ tglx: Extracted and combined from several patches ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the introduction of stacked domains, we have the issue that,
depending on where in the stack this is called, __irq_set_handler
will succeed or fail: If this is called from the inner irqchip,
__irq_set_handler() will fail, as it will look at the outer domain
as the (desc->irq_data.chip == &no_irq_chip) test fails (we haven't
set the top level yet).
This patch implements the following: "If there is at least one
valid irqchip in the domain, it will probably sort itself out".
This is clearly not ideal, but it is far less confusing then
crashing because the top-level domain is not up yet.
[ tglx: Added comment and a protection against chained interrupts in
that context ]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1416048553-29289-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce helper function irq_domain_add_hierarchy(), which creates
a linear irqdomain if parameter 'size' is not zero, otherwise creates
a tree irqdomain.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/1416061447-9472-5-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a flags to irq_domain.flags to control whether the irqdomain core
should automatically call parent irqdomain's alloc/free callbacks. It
help to reduce hierarchy irqdomains users' code size.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/1416061447-9472-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add IRQ_SET_MASK_OK_DONE in addition to IRQ_SET_MASK_OK and
IRQ_SET_MASK_OK_NOCOPY to support stacked irqchip. IRQ_SET_MASK_OK_DONE
is the same as IRQ_SET_MASK_OK to irq core. To stacked irqchip, it means
that ascendant irqchips have done all the work and no more handling
needed in descendant irqchips.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add callback irq_compose_msi_msg to struct irq_chip, which will be used
to support stacked irqchip.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now we already support hierarchy irq_data, so introduce several helpers
to support stacked irq_chips.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It is possible to call irq_create_of_mapping to create/translate the
same IRQ from DT for multiple times. Perform irq_find_mapping check
and set_type for hierarchy irqdomain in irq_create_of_mapping() to
avoid duplicate these functionality in all outer most irqdomain.
Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We plan to use hierarchy irqdomain to suppport CPU vector assignment,
interrupt remapping controller, IO-APIC controller, MSI interrupt
and hypertransport interrupt etc on x86 platforms. So extend irqdomain
interfaces to support hierarchy irqdomain.
There are already many clients of current irqdomain interfaces.
To minimize the changes, we choose to introduce new version 2 interfaces
to support hierarchy instead of extending existing irqdomain interfaces.
According to Thomas's suggestion, the most important design decision is
to build hierarchy struct irq_data to support hierarchy irqdomain, so
hierarchy irqdomain related data could be saved in struct irq_data.
With support of hierarchy irq_data, we could also support stacked
irq_chips. This is most useful in case of set_affinity().
The new hierarchy irqdomain introduces following interfaces:
1) irq_domain_alloc_irqs()/irq_domain_free_irqs(): allocate/release IRQ
and related resources.
2) __irq_domain_alloc_irqs(): a special version to support legacy IRQs.
3) irq_domain_activate_irq()/irq_domain_deactivate_irq(): program
interrupt controllers to activate/deactivate interrupt.
There are also several help functions to ease irqdomain implemenations:
1) irq_domain_get_irq_data(): get irq_data associated with a specific
irqdomain.
2) irq_domain_set_hwirq_and_chip(): save irqdomain specific data into
irq_data.
3) irq_domain_alloc_irqs_parent()/irq_domain_free_irqs_parent(): invoke
parent irqdomain's alloc/free callbacks.
We also changed irq_startup()/irq_shutdown() to invoke
irq_domain_activate_irq()/irq_domain_deactivate_irq() to program
interrupt controller when start/stop interrupts.
[ tglx: Folded parts of the later patch series in ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: two NUMA fixes, two cputime fixes and an RCU/lockdep fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
sched/cputime: Fix cpu_timer_sample_group() double accounting
sched/numa: Avoid selecting oneself as swap target
sched/numa: Fix out of bounds read in sched_init_numa()
sched: Remove lockdep check in sched_move_task()
Pull perf fixes from Ingo Molnar:
"Misc fixes: two Intel uncore driver fixes, a CPU-hotplug fix and a
build dependencies fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix boot crash on SBOX PMU on Haswell-EP
perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
perf: Fix corruption of sibling list with hotplug
perf/x86: Fix embarrasing typo
Fix up a few comments that weren't updated when the
functions were converted to use timespec64 structures.
Acked-by: Arnd Bergmann <arnd.bergmann@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Adds a timespec64 based get_monotonic_coarse64() implementation
that can be used as we convert internal users of
get_monotonic_coarse away from using timespecs.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Adds a timespec64 based getrawmonotonic64() implementation
that can be used as we convert internal users of
getrawmonotonic away from using timespecs.
Signed-off-by: John Stultz <john.stultz@linaro.org>
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds safe mktime64() using time64_t.
After this patch, mktime() is deprecated and all its call sites
will be fixed using mktime64(), after that it can be removed.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds timekeeping_inject_sleeptime64() using timespec64.
After this patch, timekeeping_inject_sleeptime() is deprecated
and all its call sites will be fixed using the new interface,
after that it can be removed.
NOTE: timekeeping_inject_sleeptime() is safe actually, but we
want to eliminate timespec eventually, so comes this patch.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The kernel uses 32-bit signed value(time_t) for seconds elapsed
1970-01-01:00:00:00, thus it will overflow at 2038-01-19 03:14:08
on 32-bit systems. This is widely known as the y2038 problem.
As part of addressing "y2038 problem" for in-kernel uses, this patch
adds safe do_settimeofday64() using timespec64.
After this patch, do_settimeofday() is deprecated and all its call
sites will be fixed using do_settimeofday64(), after that it can be
removed.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The clocksource mult-adjustment threshold is [mult-maxadj, mult+maxadj],
timekeeping_adjust() only deals with the upper threshold, but misses the
lower threshold.
This patch adds the lower threshold judging condition.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Minor fix for > 80 char line]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Ideally, __clocksource_updatefreq_scale, selects the largest shift
value possible for a clocksource. This results in the mult memember of
struct clocksource being particularly large, although not so large
that NTP would adjust the clock to cause it to overflow.
That said, nothing actually prohibits an overflow from occuring, its
just that it "shouldn't" occur.
So while very unlikely, and so far never observed, the value of
(cs->mult+cs->maxadj) may have a chance to reach very near 0xFFFFFFFF,
so there is a possibility it may overflow when doing NTP positive
adjustment
See the following detail: When NTP slewes the clock, kernel goes
through update_wall_time()->...->timekeeping_apply_adjustment():
tk->tkr.mult += mult_adj;
Since there is no guard against it, its possible tk->tkr.mult may
overflow during this operation.
This patch avoids any possible mult overflow by judging the overflow
case before adding mult_adj to mult, also adds the WARNING message
when capturing such case.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Reworded commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Kees requested that this test module be renamed for consistency sake,
so this patch renames the udelay_test.c file (recently added to
tip/timers/core for 3.17) to test_udelay.c
Cc: Kees Cook <keescook@chromium.org>
Cc: Greg KH <greg@kroah.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linux-Next <linux-next@vger.kernel.org>
Cc: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
ftrace users who may modify regs->ip to change the execution
path. If two or more users modify the regs->ip on the same
function entry, one of them will be broken. So they must add
IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.
Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
flag to have notrace hash, and the ftrace_ops must have a
filter hash (so that the ftrace_ops can hook only specific
entries), because it strongly depends on the address and
must be allowed for only few selected functions.
Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ fixed up some of the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To avoid include hell, the per_cpu variable printk_func was declared
in percpu.h. But it is only defined if printk is defined.
As users of printk may also use the printk_func variable, it needs to
be defined even if CONFIG_PRINTK is not.
Also add a printk.h include in percpu.h just to be safe.
Link: http://lkml.kernel.org/r/20141121183215.01ba539c@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix up a few typos in comments and convert an int into a bool in
update_traceon_count().
Link: http://lkml.kernel.org/r/546DD445.5080108@hitachi.com
Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Being able to divert printk to call another function besides the normal
logging is useful for such things like NMI handling. If some functions
are to be called from NMI that does printk() it is possible to lock up
the box if the nmi handler triggers when another printk is happening.
One example of this use is to perform a stack trace on all CPUs via NMI.
But if the NMI is to do the printk() it can cause the system to lock up.
By allowing the printk to be diverted to another function that can safely
record the printk output and then print it when it in a safe context
then NMIs will be safe to call these functions like show_regs().
Link: http://lkml.kernel.org/p/20140619213952.209176403@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_buf functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.
The seq_buf utility is similar to the seq_file utility, but instead of
writing sending data back up to userland, it writes it into a buffer
defined at seq_buf_init(). This allows us to send a descriptor around
that writes printf() formatted strings into it that can be retrieved
later.
It is currently used by the tracing facility for such things like trace
events to convert its binary saved data in the ring buffer into an
ASCII human readable context to be displayed in /sys/kernel/debug/trace.
It can also be used for doing NMI prints safely from NMI context into
the seq_buf and retrieved later and dumped to printk() safely. Doing
printk() from an NMI context is dangerous because an NMI can preempt
a current printk() and deadlock on it.
Link: http://lkml.kernel.org/p/20140619213952.058255809@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function bstr_printf() from lib/vsprnintf.c is only available if
CONFIG_BINARY_PRINTF is defined. This is due to the only user currently
being the tracing infrastructure, which needs to select this config
when tracing is configured. Until there is another user of the binary
printf formats, this will continue to be the case.
Since seq_buf.c is now lives in lib/ and is compiled even without
tracing, it must encompass its use of bstr_printf() which is used
by seq_buf_printf(). This too is only used by the tracing infrastructure
and is still encapsulated by the CONFIG_BINARY_PRINTF.
Link: http://lkml.kernel.org/r/20141104160222.969013383@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add two helper functions; seq_buf_get_buf() and seq_buf_commit() that
are used by seq_buf_path(). This makes the code similar to the
seq_file: seq_path() function, and will help to be able to consolidate
the functions between seq_file and trace_seq.
Link: http://lkml.kernel.org/r/20141104160222.644881406@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.977571447@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently seq_buf is full when all but one byte of the buffer is
filled. Change it so that the seq_buf is full when all of the
buffer is filled.
Some of the functions would fill the buffer completely and report
everything was fine. This was inconsistent with the max of size - 1.
Changing this to be max of size makes all functions consistent.
Link: http://lkml.kernel.org/r/20141104160222.502133196@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.811957882@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a seq_buf_can_fit() helper function that removes the possible mistakes
of comparing the seq_buf length plus added data compared to the size of
the buffer.
Link: http://lkml.kernel.org/r/20141118164025.GL23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To be really paranoid about writing out of bound data in
trace_printk_seq(), add another check of len compared to size.
Link: http://lkml.kernel.org/r/20141119144004.GB2332@dhcp128.suse.cz
Suggested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the seq_buf->len will soon be +1 size when there's an overflow, we
must use trace_seq_used() or seq_buf_used() methods to get the real
length. This will prevent buffer overflow issues if just the len
of the seq_buf descriptor is used to copy memory.
Link: http://lkml.kernel.org/r/20141114121911.09ba3d38@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing_fill_pipe_page() logic is a little confusing with the
use of count saving the seq.len and reusing it.
Instead of subtracting a number that is calculated from the saved
value of the seq.len from seq.len, just save the seq.len at the start
and if we need to reset it, just assign it again.
When the seq_buf overflow is len == size + 1, the current logic will
break. Changing it to use a saved length for resetting back to the
original value is more robust and will work when we change the way
seq_buf sets the overflow.
Link: http://lkml.kernel.org/r/20141118161546.GJ23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rewrite seq_buf_path() like it is done in seq_path() and allow
it to accept any escape character instead of just "\n".
Making seq_buf_path() like seq_path() will help prevent problems
when converting seq_file to use the seq_buf logic.
Link: http://lkml.kernel.org/r/20141104160222.048795666@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.338523371@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a seq_buf layer that trace_seq sits on. The seq_buf will not
be limited to page size. This will allow other usages of seq_buf
instead of a hard set PAGE_SIZE one that trace_seq has.
Link: http://lkml.kernel.org/r/20141104160221.864997179@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/5468F875.7080907@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- fix NULL pointer dereference:
kernel/bpf/arraymap.c:41 array_map_alloc() error: potential null dereference 'array'. (kzalloc returns null)
kernel/bpf/arraymap.c:41 array_map_alloc() error: we previously assumed 'array' could be null (see line 40)
- integer overflow check was missing in arraymap
(hashmap checks for overflow via kmalloc_array())
- arraymap can round_up(value_size, 8) to zero. check was missing.
- hashmap was missing zero size check as well, since roundup_pow_of_two() can
truncate into zero
- found a typo in the arraymap comment and unnecessary empty line
Fix all of these issues and make both overflow checks explicit U32 in size.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the trace_seq of ftrace_raw_output_prep() is full this function
returns TRACE_TYPE_PARTIAL_LINE, otherwise it returns zero.
The problem is that TRACE_TYPE_PARTIAL_LINE happens to be zero!
The thing is, the caller of ftrace_raw_output_prep() expects a
success to be zero. Change that to expect it to be
TRACE_TYPE_HANDLED.
Link: http://lkml.kernel.org/r/20141114112522.GA2988@dhcp128.suse.cz
Reminded-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq_printf() and friends are used to store strings into a buffer
that can be passed around from function to function. If the trace_seq buffer
fills up, it will not print any more. The return values were somewhat
inconsistant and using trace_seq_has_overflowed() was a better way to know
if the write to the trace_seq buffer succeeded or not.
Now that all users have removed reading the return value of the printf()
type functions, they can safely return void and keep future users of them
from reading the inconsistent values as well.
Link: http://lkml.kernel.org/r/20141114011411.992510720@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will not be returning values
soon and will be void functions. To know if they succeeded or not, the
functions trace_seq_has_overflowed() and trace_handle_return() should be
used instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon no longer have
return values. Using trace_seq_has_overflowed() and trace_handle_return()
should be used instead.
Link: http://lkml.kernel.org/r/20141114011411.693008134@goodmis.org
Link: http://lkml.kernel.org/r/20141115050602.333705855@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon not have a return
value and will only be a void function. Use trace_seq_has_overflowed()
instead to know if the trace_seq operations succeeded or not.
Link: http://lkml.kernel.org/r/20141114011411.530216306@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The return values for trace_seq_printf() and friends are going to be
removed and they will become void functions. The mmio tracer checked
their return and even did so incorrectly.
Some of the funtions which returned the values were never checked
themselves. Removing all the checks simplifies the code.
Use trace_seq_has_overflowed() and trace_handle_return() where
necessary instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of checking the return value of trace_seq_printf() and friends
for overflowing of the buffer, use the trace_seq_has_overflowed() helper
function.
This cleans up the code quite a bit and also takes us a step closer to
changing the return values of trace_seq_printf() and friends to void.
Link: http://lkml.kernel.org/r/20141114011411.181812785@goodmis.org
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of doing individual checks all over the place that makes the code
very messy. Just check trace_seq_has_overflowed() at the end or in
strategic places.
This makes the code much cleaner and also helps with getting closer
to removing the return values of trace_seq_printf() and friends.
Link: http://lkml.kernel.org/r/20141114011410.987913836@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The branch tracer should not be checking the trace_seq_printf() return value
as that will soon be void. There's a new trace_handle_return() helper function
that will return TRACE_TYPE_PARTIAL_LINE if the trace_seq overflowed
and TRACE_TYPE_HANDLED otherwise.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove checking the return value of all trace_seq_puts(). It was wrong
anyway as only the last return value mattered. But as the trace_seq_puts()
is going to be a void function in the future, we should not be checking
the return value of it anyway.
Just return !trace_seq_has_overflowed() instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Checking the return code of every trace_seq_printf() operation and having
to return early if it overflowed makes the code messy.
Using the new trace_seq_has_overflowed() and trace_handle_return() functions
allows us to clean up the code.
In the future, trace_seq_printf() and friends will be turning into void
functions and not returning a value. The trace_seq_has_overflowed() is to
be used instead. This cleanup allows that change to take place.
Cc: Jens Axboe <axboe@fb.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a trace_seq_has_overflowed() which returns true if the trace_seq
had too much written into it allows us to simplify the code.
Instead of checking the return value of every call to trace_seq_printf()
and friends, they can all be called normally, and at the end we can
return !trace_seq_has_overflowed() instead.
Several functions also return TRACE_TYPE_PARTIAL_LINE when the trace_seq
overflowed and TRACE_TYPE_HANDLED otherwise. Another helper function
was created called trace_handle_return() which takes a trace_seq and
returns these enums. Using this helper function also simplifies the
code.
This change also makes it possible to remove the return values of
trace_seq_printf() and friends. They should instead just be
void functions.
Link: http://lkml.kernel.org/r/20141114011410.365183157@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In trace_seq_bitmask() it calls bitmap_scnprintf() not from the current
position of the trace_seq buffer (s->buffer + s->len), but instead from
the beginning of the buffer (s->buffer).
Luckily, the only user of this "ipi_raise tracepoint" uses it as the
first parameter, and as such, the start of the temp buffer in
include/trace/ftrace.h (see __get_bitmask()).
Reported-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Stack traces that happen from function tracing check if the address
on the stack is a __kernel_text_address(). That is, is the address
kernel code. This calls core_kernel_text() which returns true
if the address is part of the builtin kernel code. It also calls
is_module_text_address() which returns true if the address belongs
to module code.
But what is missing is ftrace dynamically allocated trampolines.
These trampolines are allocated for individual ftrace_ops that
call the ftrace_ops callback functions directly. But if they do a
stack trace, the code checking the stack wont detect them as they
are neither core kernel code nor module address space.
Adding another field to ftrace_ops that also stores the size of
the trampoline assigned to it we can create a new function called
is_ftrace_trampoline() that returns true if the address is a
dynamically allocate ftrace trampoline. Note, it ignores trampolines
that are not dynamically allocated as they will return true with
the core_kernel_text() function.
Link: http://lkml.kernel.org/r/20141119034829.497125839@goodmis.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.
[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add the new __NR_s390_pci_mmio_write and __NR_s390_pci_mmio_read
system calls to allow user space applications to access device PCI I/O
memory pages on s390x platform.
[ Martin Schwidefsky: some code beautification ]
Signed-off-by: Alexey Ishchuk <aishchuk@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The function probe counting for traceon and traceoff suffered a race
condition where if the probe was executing on two or more CPUs at the
same time, it could decrement the counter by more than one when
disabling (or enabling) the tracer only once.
The way the traceon and traceoff probes are suppose to work is that
they disable (or enable) tracing once per count. If a user were to
echo 'schedule:traceoff:3' into set_ftrace_filter, then when the
schedule function was called, it would disable tracing. But the count
should only be decremented once (to 2). Then if the user enabled tracing
again (via tracing_on file), the next call to schedule would disable
tracing again and the count would be decremented to 1.
But if multiple CPUS called schedule at the same time, it is possible
that the count would be decremented more than once because of the
simple "count--" used.
By reading the count into a local variable and using memory barriers
we can guarantee that the count would only be decremented once per
disable (or enable).
The stack trace probe had a similar race, but here the stack trace will
decrement for each time it is called. But this had the read-modify-
write race, where it could stack trace more than the number of times
that was specified. This case we use a cmpxchg to stack trace only the
number of times specified.
The dump probes can still use the old "update_count()" function as
they only run once, and that is controlled by the dump logic
itself.
Link: http://lkml.kernel.org/r/20141118134643.4b550ee4@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The number of and dependencies between high-level power management
Kconfig options make life much harder than necessary. Several
conbinations of them have to be tested and supported, even though
some of those combinations are very rarely used in practice (if
they are used in practice at all). Moreover, the fact that we
have separate independent Kconfig options for runtime PM and
system suspend is a serious obstacle for integration between
the two frameworks.
To overcome these difficulties, always select PM_RUNTIME if PM_SLEEP
is set. Among other things, this will allow system suspend callbacks
provided by bus types and device drivers to rely on the runtime PM
framework regardless of the kernel configuration.
Enthusiastically-acked-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
proper types and function helpers are ready. Use them in verifier testsuite.
Remove temporary stubs
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage
described it in commit b4fc1a460f30("Merge branch 'bpf-next'"):
-----
BPF_MAP_LOOKUP_ELEM
int bpf_lookup_elem(int fd, void *key, void *value)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
};
return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}
bpf() syscall looks up an element with given key in a map fd.
If element is found it returns zero and stores element's value
into value. If element is not found it returns -1 and sets
errno to ENOENT.
and further down in manpage:
ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that
element with given key was not found.
-----
In general all BPF commands return ENOENT when map element is not found
(including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with
flags == BPF_MAP_UPDATE_ONLY)
Subsequent patch adds a testsuite to check return values for all of
these combinations.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type BPF_MAP_TYPE_ARRAY and its implementation
- optimized for fastest possible lookup()
. in the future verifier/JIT may recognize lookup() with constant key
and optimize it into constant pointer. Can optimize non-constant
key into direct pointer arithmetic as well, since pointers and
value_size are constant for the life of the eBPF program.
In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
while preserving concurrent access to this map from user space
- two main use cases for array type:
. 'global' eBPF variables: array of 1 element with key=0 and value is a
collection of 'global' variables which programs can use to keep the state
between events
. aggregation of tracing events into fixed set of buckets
- all array elements pre-allocated and zero initialized at init time
- key as an index in array and can only be 4 byte
- map_delete_elem() returns EINVAL, since elements cannot be deleted
- map_update_elem() replaces elements in an non-atomic way
(for atomic updates hashtable type should be used instead)
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type BPF_MAP_TYPE_HASH and its implementation
- maps are created/destroyed by userspace. Both userspace and eBPF programs
can lookup/update/delete elements from the map
- eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
for concurrent updates
- key/value are opaque range of bytes (aligned to 8 bytes)
- user space provides 3 configuration attributes via BPF syscall:
key_size, value_size, max_entries
- map takes care of allocating/freeing key/value pairs
- map_update_elem() must fail to insert new element when max_entries
limit is reached to make sure that eBPF programs cannot exhaust memory
- map_update_elem() replaces elements in an atomic way
- optimized for speed of lookup() which can be called multiple times from
eBPF program which itself is triggered by high volume of events
. in the future JIT compiler may recognize lookup() call and optimize it
further, since key_size is constant for life of eBPF program
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
either update existing map element or create a new one.
Initially the plan was to add a new command to handle the case of
'create new element if it didn't exist', but 'flags' style looks
cleaner and overall diff is much smaller (more code reused), so add 'flags'
attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
#define BPF_ANY 0 /* create new element or update existing */
#define BPF_NOEXIST 1 /* create new element if it didn't exist */
#define BPF_EXIST 2 /* update existing element */
bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
if element already exists.
bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
if element doesn't exist.
Userspace will call it as:
int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
.flags = flags;
};
return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}
First two bits of 'flags' are used to encode style of bpf_update_elem() command.
Bits 2-63 are reserved for future use.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement cgroup_get_e_css() which finds and gets the effective css
for the specified cgroup and subsystem combination. This function
always returns a valid pinned css. This will be used by cgroup
writeback support.
While at it, add comment to cgroup_e_css() to explain why that
function is different from cgroup_get_e_css() and has to test
cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss).
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Add a new cgroup_subsys operatoin ->css_e_css_changed(). This is
invoked if any of the effective csses seen from the css's cgroup may
have changed. This will be used to implement cgroup writeback
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Add a new cgroup subsys callback css_released(). This is called when
the reference count of the css (cgroup_subsys_state) reaches zero
before RCU scheduling free.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
When a subsystem is offlined, its entry on @cgrp->subsys[] is cleared
asynchronously. If cgroup_subtree_control_write() is requested to
enable the subsystem again before the entry is cleared, it has to wait
for the previous offlining to finish and clear the @cgrp->subsys[]
entry before trying to enable the subsystem again.
This is currently done while verifying the input enable / disable
parameters. This used to be correct but f63070d350 ("cgroup: make
interface files visible iff enabled on cgroup->subtree_control")
breaks it. The commit is one of the commits implementing subsystem
dependency.
Through subsystem dependency, some subsystems may be enabled and
disabled implicitly in addition to the explicitly requested ones. The
actual subsystems to be enabled and disabled are determined during
@css_enable/disable calculation. The current offline wait logic skips
the ones which are already implicitly enabled and then waits for
subsystems in @enable; however, this misses the subsystems which may
be implicitly enabled through dependency from @enable. If such
implicitly subsystem hasn't yet finished offlining yet, the function
ends up trying to create a css when its @cgrp->subsys[] slot is
already occupied triggering BUG_ON() in init_and_link_css().
Fix it by moving the wait logic after @css_enable is calculated and
waiting for all the subsystems in @css_enable. This fixes the above
bug as the mask contains all subsystems which are to be enabled
including the ones enabled through dependencies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f63070d350 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control")
Acked-by: Zefan Li <lizefan@huawei.com>
Make cgroup_subtree_control_write() first calculate new
subtree_control (new_sc), child_subsys_mask (new_ss) and
css_enable/disable masks before applying them to the cgroup. Also,
store the original subtree_control (old_sc) and child_subsys_mask
(old_ss) and use them to restore the orignal state after failure.
This patch shouldn't cause any behavior changes. This prepares for a
fix for a bug in the async css offline wait logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup_refresh_child_subsys_mask() calculates and updates the
effective @cgrp->child_subsys_maks according to the current
@cgrp->subtree_control. Separate out the calculation part into
cgroup_calc_child_subsys_mask(). This will be used to fix a bug in
the async css offline wait logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
If BL_SWITCHER is enabled but SUSPEND and CPU_IDLE is not enabled
we are getting following config warning.
warning: (BL_SWITCHER) selects CPU_PM which has unmet direct
dependencies (SUSPEND || CPU_IDLE)
It has been noticed that CPU_PM dependencies in this file are not really
required so let's remove these dependencies from CPU_PM.
Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The vfree() function performs also input parameter validation. Thus the test
around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds new fields about bound violation into siginfo
structure. si_lower and si_upper are respectively lower bound
and upper bound when bound violation is caused.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151819.1908C900@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The version field defined in the audit status structure was found to have
limitations in terms of its expressibility of features supported. This is
distict from the get/set features call to be able to command those features
that are present.
Converting this field from a version number to a feature bitmap will allow
distributions to selectively backport and support certain features and will
allow upstream to be able to deprecate features in the future. It will allow
userspace clients to first query the kernel for which features are actually
present and supported. Currently, EINVAL is returned rather than EOPNOTSUP,
which isn't helpful in determining if there was an error in the command, or if
it simply isn't supported yet. Past features are not represented by this
bitmap, but their use may be converted to EOPNOTSUP if needed in the future.
Since "version" is too generic to convert with a #define, use a union in the
struct status, introducing the member "feature_bitmap" unionized with
"version".
Convert existing AUDIT_VERSION_* macros over to AUDIT_FEATURE_BITMAP*
counterparts, leaving the former for backwards compatibility.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: minor whitespace tweaks]
Signed-off-by: Paul Moore <pmoore@redhat.com>
This patch reorders fields in the perf_sample_data struct in order to
minimize the number of cachelines touched in perf_sample_data_init().
It also removes some intializations which are redundant with the code
in kernel/events/core.c
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1411559322-16548-7-git-send-email-eranian@google.com
Cc: cebbert.lkml@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: jolsa@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable capture of interrupted machine state for each sample.
Registers to sample are passed per event in the sample_regs_intr bitmask.
To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in
sample_type.
The list of available registers is arch dependent and provided by asm/perf_regs.h
Registers are laid out as u64 in the order of the bit order of sample_intr_regs.
This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend
the perf_event_attr struct with a new u64 field.
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: cebbert.lkml@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Actually, cpudl_set() and cpudl_init() can never be used without
CONFIG_SMP.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415260327-30465-4-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Actually, cpupri_set() and cpupri_init() can never be used without
CONFIG_SMP.
Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do not call dequeue_pushable_dl_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.
Actually the patch is the same behavior as commit 311e800e16 ("sched,
rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.
sgs->group_capacity_factor is used to check if the group is overloaded.
When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.
However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.
This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.
Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As discussed in [1], accounting IO is meant for blkio only. Document that
so driver authors won't use them for device io.
[1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit d670ec1317 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.
Reproducer/tester can be found further below, it can be compiled and ran by:
gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
while ./tst-cpuclock2 ; do : ; done
This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".
Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.
KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .
This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.
Full reproducer (tst-cpuclock2.c):
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <inttypes.h>
/* Parameters for the Linux kernel ABI for CPU clocks. */
#define CPUCLOCK_SCHED 2
#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
static pthread_barrier_t barrier;
/* Help advance the clock. */
static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1) ;
return NULL;
}
/* Don't use the glibc wrapper. */
static int do_nanosleep(int flags, const struct timespec *req)
{
clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
}
static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
{
int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
return after_i - before_i;
}
int main(void)
{
int result = 0;
pthread_t th;
pthread_barrier_init(&barrier, NULL, 2);
if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
perror("pthread_create");
return 1;
}
pthread_barrier_wait(&barrier);
/* The test. */
struct timespec before, after, sleeptimeabs;
int64_t sleepdiff, diffabs;
const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
/* The relative nanosleep. Not sure why this is needed, but its presence
seems to make it easier to reproduce the problem. */
if (do_nanosleep(0, &sleeptime) != 0) {
perror("clock_nanosleep");
return 1;
}
/* Get the current time. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
perror("clock_gettime[2]");
return 1;
}
/* Compute the absolute sleep time based on the current time. */
uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
sleeptimeabs.tv_nsec = nsec % 1000000000;
/* Sleep for the computed time. */
if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
perror("absolute clock_nanosleep");
return 1;
}
/* Get the time after the sleep. */
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
perror("clock_gettime[3]");
return 1;
}
/* The time after sleep should always be equal to or after the absolute sleep
time passed to clock_nanosleep. */
sleepdiff = tsdiff(&sleeptimeabs, &after);
if (sleepdiff < 0) {
printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
result = 1;
printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
}
/* The difference between the timestamps taken before and after the
clock_nanosleep call should be equal to or more than the duration of the
sleep. */
diffabs = tsdiff(&before, &after);
if (diffabs < sleeptime.tv_nsec) {
printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
result = 1;
}
pthread_cancel(th);
return result;
}
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:
cpu_timer_sample_group()
thread_group_cputimer()
thread_group_cputime()
times->sum_exec_runtime += task_sched_runtime();
*sample = cputime.sum_exec_runtime + task_delta_exec();
Which would make the sample run ahead, making the sleep short.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.
Reported-and-Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a CPU hotplugged out, we call perf_remove_from_context() (via
perf_event_exit_cpu()) to rip each CPU-bound event out of its PMU's cpu
context, but leave siblings grouped together. Freeing of these events is
left to the mercy of the usual refcounting.
When a CPU-bound event's refcount drops to zero we cross-call to
__perf_remove_from_context() to clean it up, detaching grouped siblings.
This works when the relevant CPU is online, but will fail if the CPU is
currently offline, and we won't detach the event from its siblings
before freeing the event, leaving the sibling list corrupt. If the
sibling list is later walked (e.g. because the CPU cam online again
before a remaining sibling's refcount drops to zero), we will walk the
now corrupted siblings list, potentially dereferencing garbage values.
Given that the events should never be scheduled again (as we removed
them from their context), we can simply detatch siblings when the CPU
goes down in the first place. If the CPU comes back online, the
redundant call to __perf_remove_from_context() is safe.
Reported-by: Drew Richardson <drew.richardson@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: vincent.weaver@maine.edu
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415203904-25308-2-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The IA64_HP_SIM dependency on PM_RUNTIME should be done in the arch
Kconfig instead of in the PM core. Move it accordingly.
NOTE: arch/ia64/Kconfig currently does a 'select PM', which since
commit 1eb208aea3 (PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP ||
CONFIG_PM_RUNTIME)) is effectively a noop unless PM_SLEEP or
PM_RUNTIME are set elsewhere.
Signed-off-by: Kevin Hilman <khilman@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Fix a crash in the suspend-to-idle code path introduced by a
recent commit that forgot to check a pointer against NULL before
dereferencing it (Dmitry Eremin-Solenikov).
- Fix a boot crash on Exynos5 introduced by a recent commit making
that platform use generic Device Tree bindings for power domains
which exposed a weakness in the generic power domains framework
leading to that crash (Ulf Hansson).
- Fix a crash during system resume on systems where cpufreq depends
on Operation Performance Points (OPP) for functionality, but
CONFIG_OPP is not set. This leads the cpufreq driver registration
to fail, but the resume code attempts to restore the pre-suspend
cpufreq configuration (which does not exist) nevertheless and
crashes. From Geert Uytterhoeven.
- Add a new ACPI blacklist entry for Dell Vostro 3546 that has
problems if it is reported as Windows 8 compatible to the BIOS
(Adam Lee).
- Fix swapped arguments in an error message in the cpufreq-dt
driver (Abhilash Kesavan).
- Fix up the prototypes of new callbacks in struct generic_pm_domain
to make them more useful. Users of those callbacks will be added
in 3.19 and it's better for them to be based on the correct struct
definition in mainline from the start. From Ulf Hansson and
Kevin Hilman.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUZhtnAAoJEILEb/54YlRxx1EP/0Rk7pJUHeOMmdyXyY7B+n+f
MlXHVMDhskT370fsdTGbpeYb5ATr5kGatfhr+vyDQmBtxdw7lDJxKq54s6kmmIL3
SEMRRb4NtkPsdDE7zq985JmjsrnHtKxC5NjSUwEGxdyyfAZxll4mrZL6RrqXCu44
L+qdVXRffCCrJDXZl5FZUpSZ3ZUc+xTiaDy7ObjLe2bwmzvBOAwS2flBMKxN9X+e
khlGdQZ0e9T2Y3IXriHxHMui8OVbkPyYZkW1aubCd0HwuTMP7sebosX/2JWdJOmg
q6bGcvPlBwXDRoShlzFO8CN5w5E8fIe0vfPcg9SB3s21S7rJEbYQX/5ytm107aJj
Ysv7mcb2dAHG0V3J7hxhkS+7UNPxfk3G+8frxW2UQ6eIDlZkBORIUhGCzeSbIGYM
aIKiomN4jGuPeaOkEnKl4RwMlzjuzAs2V06viffbq63eyWBvtHDW8M5bdq901pXp
1jOT7yKqLzOZYqcYaLr3z+IBw/+hfuG/FdCp3uGyFqeHPBNIP3BfFnWm6A6E13b+
aC6gvhQHojT7L2gqIBJ+Qn0EiRWNqwoLk6w6DLDYJna/hYyoXq0BKv+/x2OegItU
ENKYVpfmSt3YsEhcTBW4h5IpUvK07o5Oa3nTxen6924Im61dMyaSUDD5DiaqCgXO
bVJTsF983hBZGTy0IMX/
=wQxT
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are three regression fixes, two recent (generic power domains,
suspend-to-idle) and one older (cpufreq), an ACPI blacklist entry for
one more machine having problems with Windows 8 compatibility, a minor
cpufreq driver fix (cpufreq-dt) and a fixup for new callback
definitions (generic power domains).
Specifics:
- Fix a crash in the suspend-to-idle code path introduced by a recent
commit that forgot to check a pointer against NULL before
dereferencing it (Dmitry Eremin-Solenikov).
- Fix a boot crash on Exynos5 introduced by a recent commit making
that platform use generic Device Tree bindings for power domains
which exposed a weakness in the generic power domains framework
leading to that crash (Ulf Hansson).
- Fix a crash during system resume on systems where cpufreq depends
on Operation Performance Points (OPP) for functionality, but
CONFIG_OPP is not set. This leads the cpufreq driver registration
to fail, but the resume code attempts to restore the pre-suspend
cpufreq configuration (which does not exist) nevertheless and
crashes. From Geert Uytterhoeven.
- Add a new ACPI blacklist entry for Dell Vostro 3546 that has
problems if it is reported as Windows 8 compatible to the BIOS
(Adam Lee).
- Fix swapped arguments in an error message in the cpufreq-dt driver
(Abhilash Kesavan).
- Fix up the prototypes of new callbacks in struct generic_pm_domain
to make them more useful. Users of those callbacks will be added
in 3.19 and it's better for them to be based on the correct struct
definition in mainline from the start. From Ulf Hansson and Kevin
Hilman"
* tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / Domains: Fix initial default state of the need_restore flag
PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set
PM / Domains: Change prototype for the attach and detach callbacks
cpufreq: Avoid crash in resume on SMP without OPP
cpufreq: cpufreq-dt: Fix arguments in clock failure error message
ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546
Usually, "msecs" notation means milli-seconds, and "usecs" notation
means micro-seconds. Since the unit used in the code is micro-seconds,
the notation should be replaced from msecs to usecs.
Link: http://lkml.kernel.org/r/1415171926-9782-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On the function_graph tracer, the print_graph_irq() function prints a
trace line with the flag ==========> on an irq handler entry, and the
flag <========== on an irq handler return.
But when the latency-format is enable, it is not printing the
latency-format flags, causing the following error in the trace output:
0) ==========> |
0) d... | smp_apic_timer_interrupt() {
This patch fixes this issue by printing the latency-format flags when
it is enable.
Link: http://lkml.kernel.org/r/7c2e226dac20c940b6242178fab7f0e3c9b5ce58.1415233316.git.bristot@redhat.com
Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Printing a single character to a seqfile might as well be done with
seq_putc instead of seq_puts; this avoids a strlen() call and a memory
access. It also shaves another few bytes off the generated code.
Link: http://lkml.kernel.org/r/1415479332-25944-4-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
drivers/net/ethernet/chelsio/cxgb4vf/sge.c
drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c
sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.
ixgbe_phy.c was a set of overlapping whitespace changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Consecutive seq_puts calls with literal strings can be merged to a
single call. This reduces the size of the generated code, and can also
lead to slight .rodata reduction (because of fewer nul and padding
bytes). It should also shave a off a few clock cycles.
Link: http://lkml.kernel.org/r/1415479332-25944-3-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using seq_printf to print a simple string or a single character is a
lot more expensive than it needs to be, since seq_puts and seq_putc
exist.
These patches do
seq_printf(m, s) -> seq_puts(m, s)
seq_printf(m, "%s", s) -> seq_puts(m, s)
seq_printf(m, "%c", c) -> seq_putc(m, c)
Subsequent patches will simplify further.
Link: http://lkml.kernel.org/r/1415479332-25944-2-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kdb's ftdump command will livelock by constantly printk'ing
the empty string at KERN_EMERG level if it run when the ftrace system is
not in use. This occurs because trace_empty() never returns false when
the ring buffers are left at the start of a non-consuming read [launched
by ring_buffer_read_start()].
This patch changes the loop exit condition to use the result of
trace_find_next_entry_inc(). Effectively this switches the non-consuming
kdb dumper to follow the approach of the non-consuming userspace
interface [s_next()] rather than the consuming ftrace_dump().
Link: http://lkml.kernel.org/r/1415277716-19419-3-git-send-email-daniel.thompson@linaro.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kdb's ftdump command unconditionally crashes due to a null
pointer de-reference whenever the command is run. This in turn causes
the kernel to panic.
The abridged stacktrace (gathered with ARCH=arm) is:
--- cut here ---
[<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440)
[<c02132dc>] (die) from [<c0952eb8>]
(__do_kernel_fault.part.11+0x74/0x84)
[<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>]
(do_page_fault+0x1d0/0x3c4)
[<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac)
[<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60)
Exception stack(0xc0deba88 to 0xc0debad0)
ba80: e8c29180 00000001 e9854304 e9854300 c0f567d8
c0df2580
baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000
c0debad0
bac0: 0000672e c02f4d60 60000193 ffffffff
[<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8)
[<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698)
[<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784)
[<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490)
--- cut here ---
The NULL deref occurs due to the initialized use of struct trace_iter's
buffer_iter member.
This is a regression, albeit a fairly elderly one. It was introduced
by commit 6d158a813e ("tracing: Remove NR_CPUS array from
trace_iterator").
This patch solves this by providing a collection of ring_buffer_iter(s)
and using this to initialize buffer_iter. Note that static allocation
is used solely because the trace_iter itself is also static allocated.
Static allocation also means that we have to NULL-ify the pointer during
cleanup to avoid use-after-free problems.
Link: http://lkml.kernel.org/r/1415277716-19419-2-git-send-email-daniel.thompson@linaro.org
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to the documentation, adding "traceoff_on_warning" to the boot
command line should be enough to enable the feature. But right now it is
necessary to specify "traceoff_on_warning=". Along with fixing that, also
verify if the value passed, if any, is either "0" or "off".
Link: http://lkml.kernel.org/r/20141112231400.GL12281@uudg.org
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new logic, if only a single user of ftrace function hooks is
used, it will get its own trampoline assigned to it.
The problem is that the control_ops is an indirect ops that perf ops
uses. What that means is that when perf registers its ops with
register_ftrace_function(), it has the CONTROL flag set and gets added
to the control list instead of the global ftrace list. The control_ops
gets added to that instead and the mcount trampoline calls the control_ops
function. The control_ops function will iterate the control list and
call the ops functions that are attached to it.
But currently the trampoline is added to the perf ops and not the
control ops, and when ftrace tries to find a trampoline hook for it,
it fails to find one and gives the following splat:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 10133 at kernel/trace/ftrace.c:2033 ftrace_get_addr_new+0x6f/0xc0()
Modules linked in: [...]
CPU: 0 PID: 10133 Comm: perf Tainted: P 3.18.0-rc1-test+ #388
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
00000000000007f1 ffff8800c2643bc8 ffffffff814fca6e ffff88011ea0ed01
0000000000000000 ffff8800c2643c08 ffffffff81041ffd 0000000000000000
ffffffff810c388c ffffffff81a5a350 ffff880119b00000 ffffffff810001c8
Call Trace:
[<ffffffff814fca6e>] dump_stack+0x46/0x58
[<ffffffff81041ffd>] warn_slowpath_common+0x81/0x9b
[<ffffffff810c388c>] ? ftrace_get_addr_new+0x6f/0xc0
[<ffffffff810001c8>] ? 0xffffffff810001c8
[<ffffffff81042031>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810c388c>] ftrace_get_addr_new+0x6f/0xc0
[<ffffffff8102e938>] ftrace_replace_code+0xd6/0x334
[<ffffffff810c4116>] ftrace_modify_all_code+0x41/0xc5
[<ffffffff8102eba6>] arch_ftrace_update_code+0x10/0x19
[<ffffffff810c293c>] ftrace_run_update_code+0x21/0x42
[<ffffffff810c298f>] ftrace_startup_enable+0x32/0x34
[<ffffffff810c3049>] ftrace_startup+0x14e/0x15a
[<ffffffff810c307c>] register_ftrace_function+0x27/0x40
[<ffffffff810dc118>] perf_ftrace_event_register+0x3e/0xee
[<ffffffff810dbfbe>] perf_trace_init+0x29d/0x2a9
[<ffffffff810eb422>] perf_tp_event_init+0x27/0x3a
[<ffffffff810f18bc>] perf_init_event+0x9e/0xed
[<ffffffff810f1ba4>] perf_event_alloc+0x299/0x330
[<ffffffff810f236b>] SYSC_perf_event_open+0x3ee/0x816
[<ffffffff8115a066>] ? mntput+0x2d/0x2f
[<ffffffff81142b00>] ? __fput+0xa7/0x1b2
[<ffffffff81091300>] ? do_gettimeofday+0x22/0x3a
[<ffffffff810f279c>] SyS_perf_event_open+0x9/0xb
[<ffffffff81502a92>] system_call_fastpath+0x12/0x17
---[ end trace 81a53565150e4982 ]---
Bad trampoline accounting at: ffffffff810001c8 (run_init_process+0x0/0x2d) (10000001)
Update the control_ops trampoline instead of the perf ops one.
Reported-by: lkp@01.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 69361eef90 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
but failed to update the comments for print_tainted(). So, update the
comments.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit affines rcu_tasks_kthread() to the housekeeping CPUs
in CONFIG_NO_HZ_FULL builds. This is just a default, so systems
administrators are free to put this kthread somewhere else if they wish.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull audit fixes from Paul Moore:
"After he sent the initial audit pull request for 3.18, Eric asked me
to take over the management of the audit tree, hence this pull request
to fix a couple of problems with audit.
As you can see below, the changes are minimal: adding some whitespace
to a string so userspace parses it correctly, and fixing a problem
with audit's usage of fsnotify that was causing audit watch rules to
be lost. Neither of these patches were very controversial on the
mailing lists and they fix real problems, getting them into 3.18 would
be a good thing"
* 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit:
audit: keep inode pinned
audit: AUDIT_FEATURE_CHANGE message format missing delimiting space
Audit rules disappear when an inode they watch is evicted from the cache.
This is likely not what we want.
The guilty commit is "fsnotify: allow marks to not pin inodes in core",
which didn't take into account that audit_tree adds watches with a zero
mask.
Adding any mask should fix this.
Fixes: 90b1e7a578 ("fsnotify: allow marks to not pin inodes in core")
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org # 2.6.36+
Signed-off-by: Paul Moore <pmoore@redhat.com>
The only code that references tracing_sched_switch_trace() and
tracing_sched_wakeup_trace() is the wakeup latency tracer. Those
two functions use to belong to the sched_switch tracer which has
long been removed. These functions were left behind because the
wakeup latency tracer used them. But since the wakeup latency tracer
is the only one to use them, they should be static functions inside
that code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the previous patch it is clear that "tracer_enabled" can never be
true, we can remove the "if (tracer_enabled)" code in probe_sched_switch()
and probe_sched_wakeup(). Plus we can obviously remove tracer_enabled,
ctx_trace, and sched_stopped as well.
Link: http://lkml.kernel.org/p/20140723193503.GA30217@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_{start,stop}_sched_switch_record() have no callers since
87d80de280 "tracing: Remove obsolete sched_switch tracer".
The last caller of tracing_sched_switch_assign_trace() was removed
by 30dbb20e68 "tracing: Remove boot tracer".
Link: http://lkml.kernel.org/p/20140723193501.GA30214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the introduction of the dynamic trampolines, it is useful that if
things go wrong that ftrace_bug() produces more information about what
the current state is. This can help debug issues that may arise.
Ftrace has lots of checks to make sure that the state of the system it
touchs is exactly what it expects it to be. When it detects an abnormality
it calls ftrace_bug() and disables itself to prevent any further damage.
It is crucial that ftrace_bug() produces sufficient information that
can be used to debug the situation.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the static ftrace_ops (like function tracer) enables tracing, and it
is the only callback that is referencing a function, a trampoline is
dynamically allocated to the function that calls the callback directly
instead of calling a loop function that iterates over all the registered
ftrace ops (if more than one ops is registered).
But when it comes to dynamically allocated ftrace_ops, where they may be
freed, on a CONFIG_PREEMPT kernel there's no way to know when it is safe
to free the trampoline. If a task was preempted while executing on the
trampoline, there's currently no way to know when it will be off that
trampoline.
But this is not true when it comes to !CONFIG_PREEMPT. The current method
of calling schedule_on_each_cpu() will force tasks off the trampoline,
becaues they can not schedule while on it (kernel preemption is not
configured). That means it is safe to free a dynamically allocated
ftrace ops trampoline when CONFIG_PREEMPT is not configured.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-Convert printk( to pr_foo()
-Add pr_fmt
-Coalesce formats
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Currently if an active CPU fails to respond to a roundup request the CPU
that requested the roundup will become stuck. This needlessly reduces the
robustness of the debugger.
This patch introduces a timeout allowing the system state to be examined
even when the system contains unresponsive processors. It also modifies
kdb's cpu command to make it censor attempts to switch to unresponsive
processors and to report their state as (D)ead.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Currently kiosk mode must be explicitly requested by the bootloader or
userspace. It is convenient to be able to change the default value in a
similar manner to CONFIG_MAGIC_SYSRQ_DEFAULT_MASK.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Currently all kdb commands are enabled whenever kdb is deployed. This
makes it difficult to deploy kdb to help debug certain types of
systems.
Android phones provide one example; the FIQ debugger found on some
Android devices has a deliberately weak set of commands to allow the
debugger to enabled very late in the production cycle.
Certain kiosk environments offer another interesting case where an
engineer might wish to probe the system state using passive inspection
commands without providing sufficient power for a passer by to root it.
Without any restrictions, obtaining the root rights via KDB is a matter of
a few commands, and works everywhere. For example, log in as a normal
user:
cbou:~$ id
uid=1001(cbou) gid=1001(cbou) groups=1001(cbou)
Now enter KDB (for example via sysrq):
Entering kdb (current=0xffff8800065bc740, pid 920) due to Keyboard Entry
kdb> ps
23 sleeping system daemon (state M) processes suppressed,
use 'ps A' to see all.
Task Addr Pid Parent [*] cpu State Thread Command
0xffff8800065bc740 920 919 1 0 R 0xffff8800065bca20 *bash
0xffff880007078000 1 0 0 0 S 0xffff8800070782e0 init
[...snip...]
0xffff8800065be3c0 918 1 0 0 S 0xffff8800065be6a0 getty
0xffff8800065b9c80 919 1 0 0 S 0xffff8800065b9f60 login
0xffff8800065bc740 920 919 1 0 R 0xffff8800065bca20 *bash
All we need is the offset of cred pointers. We can look up the offset in
the distro's kernel source, but it is unnecessary. We can just start
dumping init's task_struct, until we see the process name:
kdb> md 0xffff880007078000
0xffff880007078000 0000000000000001 ffff88000703c000 ................
0xffff880007078010 0040210000000002 0000000000000000 .....!@.........
[...snip...]
0xffff8800070782b0 ffff8800073e0580 ffff8800073e0580 ..>.......>.....
0xffff8800070782c0 0000000074696e69 0000000000000000 init............
^ Here, 'init'. Creds are just above it, so the offset is 0x02b0.
Now we set up init's creds for our non-privileged shell:
kdb> mm 0xffff8800065bc740+0x02b0 0xffff8800073e0580
0xffff8800065bc9f0 = 0xffff8800073e0580
kdb> mm 0xffff8800065bc740+0x02b8 0xffff8800073e0580
0xffff8800065bc9f8 = 0xffff8800073e0580
And thus gaining the root:
kdb> go
cbou:~$ id
uid=0(root) gid=0(root) groups=0(root)
cbou:~$ bash
root:~#
p.s. No distro enables kdb by default (although, with a nice KDB-over-KMS
feature availability, I would expect at least some would enable it), so
it's not actually some kind of a major issue.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
This patch introduces several new flags to collect kdb commands into
groups (later allowing them to be optionally disabled).
This follows similar prior art to enable/disable magic sysrq
commands.
The commands have been categorized as follows:
Always on: go (w/o args), env, set, help, ?, cpu (w/o args), sr,
dmesg, disable_nmi, defcmd, summary, grephelp
Mem read: md, mdr, mdp, mds, ef, bt (with args), per_cpu
Mem write: mm
Reg read: rd
Reg write: go (with args), rm
Inspect: bt (w/o args), btp, bta, btc, btt, ps, pid, lsmod
Flow ctrl: bp, bl, bph, bc, be, bd, ss
Signal: kill
Reboot: reboot
All: cpu, kgdb, (and all of the above), nmi_console
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Since we now treat KDB_REPEAT_* as flags, there is no need to
pass KDB_REPEAT_NONE. It's just the default behaviour when no
flags are specified.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The actual values of KDB_REPEAT_* enum values and overall logic stayed
the same, but we now treat the values as flags.
This makes it possible to add other flags and combine them, plus makes
the code a lot simpler and shorter. But functionality-wise, there should
be no changes.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
We're about to add more options for commands behaviour, so let's give
a more generic name to the low-level kdb command registration function.
There are just various renames, no functional changes.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
We're about to add more options for command behaviour, so let's expand
the meaning of kdb_repeat_t.
So far we just do various renames, there should be no functional changes.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The struct member is never used in the code, so we can remove it.
We will introduce real flags soon by renaming cmd_repeat to cmd_flags.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
commit 63662139e5 attempted to patch a
leak (which would only happen on OOM, ie. never), but it didn't quite
work.
This rewrites the code to be as simple as possible. add_sysfs_param()
adds a parameter. If it fails, it's the caller's responsibility to
clean up the parameters which already exist.
The kzalloc-then-always-krealloc pattern is perhaps overly simplistic,
but this code has clearly confused people. It worked on me...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Remove stop_machine from module unloading by adding new reference
counting algorithm.
This atomic refcounter works like a semaphore, it can get (be
incremented) only when the counter is not 0. When loading a module,
kmodule subsystem sets the counter MODULE_REF_BASE (= 1). And when
unloading the module, it subtracts MODULE_REF_BASE from the counter.
If no one refers the module, the refcounter becomes 0 and we can
remove the module safely. If someone referes it, we try to recover
the counter by adding MODULE_REF_BASE unless the counter becomes 0,
because the referrer can put the module right before recovering.
If the recovering is failed, we can get the 0 refcount and it
never be incremented again, it can be removed safely too.
Note that __module_get() forcibly gets the module refcounter,
users should use try_module_get() instead of that.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Replace module_ref per-cpu complex reference counter with
an atomic_t simple refcnt. This is for code simplification.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Actually since module_bug_list should be used in BUG context,
we may not need this. But for someone who want to use this
from normal context, this makes module_bug_list an RCU list.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Unlink module from module list with RCU synchronizing instead
of using stop_machine(). Since module list is already protected
by rcu, we don't need stop_machine() anymore.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Wait for RCU synchronizing on failure path of module loading
before releasing struct module, because the memory of mod->list
can still be accessed by list walkers (e.g. kallsyms).
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If the read loop in trace_buffers_splice_read() keeps failing due to
memory allocation failures without reading even a single page then this
function will keep busy looping.
Remove the risk for that by exiting the function if memory allocation
failures are seen.
Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.in
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
lockup:
# trace-cmd record -e raw_syscalls:* -F false
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
...
Call Trace:
[<ffffffff8105b580>] ? __wake_up_common+0x90/0x90
[<ffffffff81092e25>] wait_on_pipe+0x35/0x40
[<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0
[<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0
[<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290
[<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0
[<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
[<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60
[<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
[<ffffffff8112415d>] do_splice_to+0x6d/0x90
[<ffffffff81126971>] SyS_splice+0x7c1/0x800
[<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8
The problem is this: tracing_buffers_splice_read() calls
ring_buffer_wait() to wait for data in the ring buffers. The buffers
are not empty so ring_buffer_wait() returns immediately. But
tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
meaning it only wants to read a full page. When the full page is not
available, tracing_buffers_splice_read() tries to wait again with
ring_buffer_wait(), which again returns immediately, and so on.
Fix this by adding a "full" argument to ring_buffer_wait() which will
make ring_buffer_wait() wait until the writer has left the reader's
page, i.e. until full-page reads will succeed.
Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in
Cc: stable@vger.kernel.org # 3.16+
Fixes: b1169cc69b ("tracing: Remove mock up poll wait function")
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On latest mm + KASan patchset I've got this:
==================================================================
BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
=============================================================================
BUG kmalloc-8 (Not tainted): kasan error
-----------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
__slab_alloc+0x4b4/0x4f0
__kmalloc_track_caller+0x15f/0x1e0
kstrdup+0x44/0x90
alloc_vfsmnt+0xb0/0x2c0
vfs_kern_mount+0x35/0x190
kern_mount_data+0x25/0x50
pid_ns_prepare_proc+0x19/0x50
alloc_pid+0x5e2/0x630
copy_process.part.41+0xdf5/0x2aa0
do_fork+0xf5/0x460
kernel_thread+0x21/0x30
rest_init+0x1e/0x90
start_kernel+0x522/0x531
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x15b/0x16a
INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70
Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk.
Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........
Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
Call Trace:
dump_stack (lib/dump_stack.c:52)
print_trailer (mm/slub.c:645)
object_err (mm/slub.c:652)
? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
? kasan_poison_shadow (mm/kasan/kasan.c:48)
? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
? kasan_poison_shadow (mm/kasan/kasan.c:48)
? kasan_kmalloc (mm/kasan/kasan.c:311)
__asan_load4 (mm/kasan/kasan.c:371)
? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
kernel_init_freeable (init/main.c:869 init/main.c:997)
? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
? rest_init (init/main.c:924)
kernel_init (init/main.c:929)
? rest_init (init/main.c:924)
ret_from_fork (arch/x86/kernel/entry_64.S:348)
? rest_init (init/main.c:924)
Read of size 4 by task swapper/0:
Memory state around the buggy address:
ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
^
ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Zero 'level' (e.g. on non-NUMA system) causing out of bounds
access in this line:
sched_max_numa_distance = sched_domains_numa_distance[level - 1];
Fix this by exiting from sched_init_numa() earlier.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Fixes: 9942f79ba ("sched/numa: Export info needed for NUMA balancing on complex topologies")
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use io{read,write}32be if the caller specified IRQ_GC_BE_IO when creating
the irqchip.
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/1415342669-30640-5-git-send-email-cernekee@gmail.com
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
Pass in the irq_chip_generic struct so we can use different readl/writel
settings for each irqchip driver, when appropriate. Compute
(gc->reg_base + reg_offset) in the helper function because this is pretty
much what all callers want to do anyway.
Compile-tested using the following configurations:
at91_dt_defconfig (CONFIG_ATMEL_AIC_IRQ=y)
sama5_defconfig (CONFIG_ATMEL_AIC5_IRQ=y)
sunxi_defconfig (CONFIG_ARCH_SUNXI=y)
tb10x (ARC) is untested.
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/1415342669-30640-3-git-send-email-cernekee@gmail.com
Signed-off-by: Jason Cooper <jason@lakedaemon.net>
If no freeze_ops is set, trying to enter suspend-to-IDLE will cause a
nice oops in platform_suspend_prepare_late(). Add respective checks to
platform_suspend_prepare_late() and platform_resume_early() functions.
Fixes: a8d46b9e4e (ACPI / sleep: Rework the handling of ACPI GPE wakeup ...)
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
tiocspgrp() is the lone caller of session_of_pgrp(); relocate and
limit to file scope.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Reviewed-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the kernel.dmesg_restrict restriction is in place, only users with
CAP_SYSLOG should be able to access crash dumps (like: attacker is
trying to exploit a bug, watchdog reboots, attacker can happily read
crash dumps and logs).
This puts the restriction on console-* types as well as sensitive
information could have been leaked there.
Other log types are unaffected.
Signed-off-by: Sebastian Schmidt <yath@yath.de>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).
A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.
All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are both UP and SMP version of pull_dl_task(), so don't need
to check CONFIG_SMP in switched_from_dl();
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In switched_from_dl() we have to issue a resched if we successfully
pulled some task from other cpus. This patch also aligns the behavior
with -rt.
Suggested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch pushes task away if the dealine of the task is equal
to current during wake up. The same behavior as rt class.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The yield semantic of deadline class is to reduce remaining runtime to
zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
is reduced from the budget of yield task again even if it has already been
set to zero which leads to artificial overrun. This patch fix it by make
sure we don't steal some more time from the task that yielded in update_curr_dl().
Suggested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch checks if current can be pushed/pulled somewhere else
in advance to make logic clear, the same behavior as dl class.
- If current can't be migrated, useless to reschedule, let's hope
task can move out.
- If task is migratable, so let's not schedule it and see if it
can be pushed or pulled somewhere else.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As per commit f10e00f4bf ("sched/dl: Use dl_bw_of() under
rcu_read_lock_sched()"), dl_bw_of() has to be protected by
rcu_read_lock_sched().
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu
after we have found an idle cpu.
Signed-off-by: Yao Dongdong <yaodongdong@huawei.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently used hrtimer_try_to_cancel() is racy:
raw_spin_lock(&rq->lock)
... dl_task_timer raw_spin_lock(&rq->lock)
... raw_spin_lock(&rq->lock) ...
switched_from_dl() ... ...
hrtimer_try_to_cancel() ... ...
switched_to_fair() ... ...
... ... ...
... ... ...
raw_spin_unlock(&rq->lock) ... (asquired)
... ... ...
... ... ...
do_exit() ... ...
schedule() ... ...
raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock)
... ... ...
raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock)
... ... (asquired)
put_task_struct() ... ...
free_task_struct() ... ...
... ... raw_spin_unlock(&rq->lock)
... (asquired) ...
... ... ...
... (use after free) ...
So, let's implement 100% guaranteed way to cancel the timer and let's
be sure we are safe even in very unlikely situations.
rq unlocking does not limit the area of switched_from_dl() use, because
this has already been possible in pull_dl_task() below.
Let's consider the safety of of this unlocking. New code in the patch
is working when hrtimer_try_to_cancel() fails. This means the callback
is running. In this case hrtimer_cancel() is just waiting till the
callback is finished. Two
1) Since we are in switched_from_dl(), new class is not dl_sched_class and
new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
right after !dl_task() check. After that hrtimer_cancel() returns back too.
The above is:
raw_spin_lock(rq->lock); ...
... dl_task_timer()
... raw_spin_lock(rq->lock);
switched_from_dl() ...
hrtimer_try_to_cancel() ...
raw_spin_unlock(rq->lock); ...
hrtimer_cancel() ...
... raw_spin_unlock(rq->lock);
... return HRTIMER_NORESTART;
... ...
raw_spin_lock(rq->lock); ...
2) But the below is also possible:
dl_task_timer()
raw_spin_lock(rq->lock);
...
raw_spin_unlock(rq->lock);
raw_spin_lock(rq->lock); ...
switched_from_dl() ...
hrtimer_try_to_cancel() ...
... return HRTIMER_NORESTART;
raw_spin_unlock(rq->lock); ...
hrtimer_cancel(); ...
raw_spin_lock(rq->lock); ...
In this case hrtimer_cancel() returns immediately. Very unlikely case,
just to mention.
Nobody can manipulate the task, because check_class_changed() is
always called with pi_lock locked. Nobody can force the task to
participate in (concurrent) priority inheritance schemes (the same reason).
All concurrent task operations require pi_lock, which is held by us.
No deadlocks with dl_task_timer() are possible, because it returns
right after !dl_task() check (it does nothing).
If we receive a new dl_task during the time of unlocked rq, we just
don't have to do pull_dl_task() in switched_from_dl() further.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
[ Added comments]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In some cases this can trigger a true flood of output.
Requested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kauditd_thread wait loop is a bit iffy; it has a number of problems:
- calls try_to_freeze() before schedule(); you typically want the
thread to re-evaluate the sleep condition when unfreezing, also
freeze_task() issues a wakeup.
- it unconditionally does the {add,remove}_wait_queue(), even when the
sleep condition is false.
Use wait_event_freezable() that does the right thing.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: oleg@redhat.com
Cc: Eric Paris <eparis@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141002102251.GA6324@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is a race between kthread_stop() and the new wait_woken() that
can result in a lack of progress.
CPU 0 | CPU 1
|
rfcomm_run() | kthread_stop()
... |
if (!test_bit(KTHREAD_SHOULD_STOP)) |
| set_bit(KTHREAD_SHOULD_STOP)
| wake_up_process()
wait_woken() | wait_for_completion()
set_current_state(INTERRUPTIBLE) |
if (!WQ_FLAG_WOKEN) |
schedule_timeout() |
|
After which both tasks will wait.. forever.
Fix this by having wait_woken() check for kthread_should_stop() but
only for kthreads (obviously).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sched_move_task() is the only interface to change sched_task_group:
cpu_cgrp_subsys methods and autogroup_move_group() use it.
Everything is synchronized by task_rq_lock(), so cpu_cgroup_attach()
is ordered with other users of sched_move_task(). This means we do no
need RCU here: if we've dereferenced a tg here, the .attach method
hasn't been called for it yet.
Thus, we should pass "true" to task_css_check() to silence lockdep
warnings.
Fixes: eeb61e53ea ("sched: Fix race between task_group and sched_task_group")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414473874.8574.2.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 38706bc5a2 (rcutorture: Add callback-flood test) vmalloc()ed
a bunch of RCU callbacks, but failed to free them. This commit fixes
that oversight.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Add early boot self tests for RCU under CONFIG_PROVE_RCU.
Currently the only test is adding a dummy callback which increments a counter
which we then later verify after calling rcu_barrier*().
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A long string of get_online_cpus() with each followed by a
put_online_cpu() that fails to acquire cpu_hotplug.lock can result in
overflow of the cpu_hotplug.puts_pending counter. Although this is
perhaps improbably, a system with absolutely no CPU-hotplug operations
will have an arbitrarily long time in which this overflow could occur.
This commit therefore adds overflow checks to get_online_cpus() and
try_get_online_cpus().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument to rcu_cleanup_after_idle() is always the current
CPU, so drop it. This moves the smp_processor_id() from the caller to
rcu_cleanup_after_idle(), saving argument-passing overhead. Again,
the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument to rcu_prepare_for_idle() is always the current
CPU, so drop it. This in turn allows two of the uses of "cpu" in
this function to be replaced with a this_cpu_ptr() and the third by
smp_processor_id(), replacing that of the call to rcu_prepare_for_idle().
Again, the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop
it. This in turn allows the "cpu" argument to rcu_cpu_has_callbacks()
to be removed, which allows the uses of "cpu" in both functions to be
replaced with a this_cpu_ptr(). Again, the anticipated cross-CPU uses
of these functions has been replaced by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it. This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Because rcu_preempt_check_callbacks()'s argument is guaranteed to
always be the current CPU, drop the argument and replace per_cpu()
with __this_cpu_read().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Because rcu_pending()'s argument is guaranteed to always be the current
CPU, drop the argument and replace per_cpu_ptr() with this_cpu_ptr().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument was kept around on the off-chance that RCU might
offload scheduler-clock interrupts. However, this offload approach
has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
from qualifying CPUs. It is therefore time to remove the "cpu" argument
to rcu_check_callbacks(), which this commit does.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU. This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory. This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.
Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp). Remove the parameter if constant and calculate the
pointer in function.
This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.
Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
This patch migrates swsusp_show_speed and its callers to using ktime_t instead
of 'struct timeval' which suffers from the y2038 problem.
Changes to swsusp_show_speed:
- use ktime_t for start and stop times
- pass start and stop times by value
Calling functions affected:
- load_image
- load_image_lzo
- save_image
- save_image_lzo
- hibernate_preallocate_memory
Design decisions:
- use ktime_t to preserve same granularity of reporting as before
- use centisecs logic as before to avoid 'div by zero' issues caused by
using seconds and nanoseconds directly
- use monotonic time (ktime_get()) since we only care about elapsed time.
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Fix a crash on r8a7791/koelsch during resume from system suspend
caused by a recent cpufreq-dt commit (Geert Uytterhoeven).
- Fix an MFD enumeration problem introduced by a recent commit
adding ACPI support to the MFD subsystem that exposed a weakness
in the ACPI core causing ACPI enumeration to be applied to all
devices associated with one ACPI companion object, although it
should be used for one of them only (Mika Westerberg).
- Fix an ACPI EC regression introduced during the 3.17 cycle
causing some Samsung laptops to misbehave as a result of a
workaround targeted at some Acer machines. That includes
a revert of a commit that went too far and a quirk for the
Acer machines in question. From Lv Zheng.
- Fix a regression in the system suspend error code path introduced
during the 3.15 cycle that causes it to fail to take errors from
asychronous execution of "late" suspend callbacks into account
(Imre Deak).
- Fix a long-standing bug in the hibernation resume error code path
that fails to roll back everything correcty on "freeze" callback
errors and leaves some devices in a "suspended" state causing more
breakage to happen subsequently (Imre Deak).
- Make the cpufreq-dt driver disable operation performance points
that are not supported by the VR connected to the CPU voltage
plane with acceptable tolerance instead of constantly failing
voltage scaling later on (Lucas Stach).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUVAuPAAoJEILEb/54YlRxGfQP/0nFfTqyuDN8cPA2qRIzDIoi
8PTOzlhrRuzUlpMkYdsDijxwFcK2/59LomwtuAKHi7309N6UzUa8vAkb8WrzpY7m
XUU+fhsLEkDnEczMfgmbP5ljtP75eJSSWRO0WIBuk4k79qcsutLNtgGpJV7feYSv
+t7OE9DrBPM8lSpBKM/4qs5gnXzdaWmi4xGH7upQWyxAC6RG9GosKdDUZxVxSJQt
oy/y0O4oxwyjg+8EvPwd22JtoFJ6axoEwCJXXlkn7NbIQNGtxrMR9zcMglsuOklg
bG93g1xJl4YCwLXV8sKfPU2kQkQ1ISY3rYIkwIjvBNIY4QFsQpCg3GYt08OJI0bO
4wDD7kH8C51aD9Zfi9luCdE4MsMyGB7SeNvQJul5uMujuG9ZeI61a8d7P6fmXu5X
lk+GeNl/rMujaESwqQlNgm3DvSYfc5FFEDC6F4Wcu4koomSlJwj//lMlOg2ajIgz
p5En6FeC8yGTuobGqo2dT7yYjmxm+kdX+gTStsto+hkxWA7beNjI1iXXWwPrQa/F
7pzneSrdbTZVdzZ1F9eR9AcGljhRMLBxs2XembXgkviCv+IVjw4qHWWKveDQKkhG
CVtcd3jrFSRHeAaqVNnbsoMu2nOLRY2W+f2+FNEfYKc+13aDJYm7pyAOIjujY7ns
Q1jSP7ZZQBVlxP5j5W5x
=g4QU
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are fixes received after my previous pull request plus one that
has been in the works for quite a while, but its previous version
caused problems to happen, so it's been deferred till now.
Fixed are two recent regressions (MFD enumeration and cpufreq-dt),
ACPI EC regression introduced in 3.17, system suspend error code path
regression introduced in 3.15, an older bug related to recovery from
failing resume from hibernation and a cpufreq-dt driver issue related
to operation performance points.
Specifics:
- Fix a crash on r8a7791/koelsch during resume from system suspend
caused by a recent cpufreq-dt commit (Geert Uytterhoeven).
- Fix an MFD enumeration problem introduced by a recent commit adding
ACPI support to the MFD subsystem that exposed a weakness in the
ACPI core causing ACPI enumeration to be applied to all devices
associated with one ACPI companion object, although it should be
used for one of them only (Mika Westerberg).
- Fix an ACPI EC regression introduced during the 3.17 cycle causing
some Samsung laptops to misbehave as a result of a workaround
targeted at some Acer machines. That includes a revert of a commit
that went too far and a quirk for the Acer machines in question.
From Lv Zheng.
- Fix a regression in the system suspend error code path introduced
during the 3.15 cycle that causes it to fail to take errors from
asychronous execution of "late" suspend callbacks into account
(Imre Deak).
- Fix a long-standing bug in the hibernation resume error code path
that fails to roll back everything correcty on "freeze" callback
errors and leaves some devices in a "suspended" state causing more
breakage to happen subsequently (Imre Deak).
- Make the cpufreq-dt driver disable operation performance points
that are not supported by the VR connected to the CPU voltage plane
with acceptable tolerance instead of constantly failing voltage
scaling later on (Lucas Stach)"
* tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / EC: Fix regression due to conflicting firmware behavior between Samsung and Acer.
Revert "ACPI / EC: Add support to disallow QR_EC to be issued before completing previous QR_EC"
cpufreq: cpufreq-dt: Restore default cpumask_setall(policy->cpus)
PM / Sleep: fix recovery during resuming from hibernation
PM / Sleep: fix async suspend_late/freeze_late error handling
ACPI: Use ACPI companion to match only the first physical device
cpufreq: cpufreq-dt: disable unsupported OPPs
Pull networking fixes from David Miller:
"A bit has accumulated, but it's been a week or so since my last batch
of post-merge-window fixes, so...
1) Missing module license in netfilter reject module, from Pablo.
Lots of people ran into this.
2) Off by one in mac80211 baserate calculation, from Karl Beldan.
3) Fix incorrect return value from ax88179_178a driver's set_mac_addr
op, which broke use of it with bonding. From Ian Morgan.
4) Checking of skb_gso_segment()'s return value was not all
encompassing, it can return an SKB pointer, a pointer error, or
NULL. Fix from Florian Westphal.
This is crummy, and longer term will be fixed to just return error
pointers or a real SKB.
6) Encapsulation offloads not being handled by
skb_gso_transport_seglen(). From Florian Westphal.
7) Fix deadlock in TIPC stack, from Ying Xue.
8) Fix performance regression from using rhashtable for netlink
sockets. The problem was the synchronize_net() invoked for every
socket destroy. From Thomas Graf.
9) Fix bug in eBPF verifier, and remove the strong dependency of BPF
on NET. From Alexei Starovoitov.
10) In qdisc_create(), use the correct interface to allocate
->cpu_bstats, otherwise the u64_stats_sync member isn't
initialized properly. From Sabrina Dubroca.
11) Off by one in ip_set_nfnl_get_byindex(), from Dan Carpenter.
12) nf_tables_newchain() was erroneously expecting error pointers from
netdev_alloc_pcpu_stats(). It only returna a valid pointer or
NULL. From Sabrina Dubroca.
13) Fix use-after-free in _decode_session6(), from Li RongQing.
14) When we set the TX flow hash on a socket, we mistakenly do so
before we've nailed down the final source port. Move the setting
deeper to fix this. From Sathya Perla.
15) NAPI budget accounting in amd-xgbe driver was counting descriptors
instead of full packets, fix from Thomas Lendacky.
16) Fix total_data_buflen calculation in hyperv driver, from Haiyang
Zhang.
17) Fix bcma driver build with OF_ADDRESS disabled, from Hauke
Mehrtens.
18) Fix mis-use of per-cpu memory in TCP md5 code. The problem is
that something that ends up being vmalloc memory can't be passed
to the crypto hash routines via scatter-gather lists. From Eric
Dumazet.
19) Fix regression in promiscuous mode enabling in cdc-ether, from
Olivier Blin.
20) Bucket eviction and frag entry killing can race with eachother,
causing an unlink of the object from the wrong list. Fix from
Nikolay Aleksandrov.
21) Missing initialization of spinlock in cxgb4 driver, from Anish
Bhatt.
22) Do not cache ipv4 routing failures, otherwise if the sysctl for
forwarding is subsequently enabled this won't be seen. From
Nicolas Cavallari"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (131 commits)
drivers: net: cpsw: Support ALLMULTI and fix IFF_PROMISC in switch mode
drivers: net: cpsw: Fix broken loop condition in switch mode
net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
stmmac: pci: set default of the filter bins
net: smc91x: Fix gpios for device tree based booting
mpls: Allow mpls_gso to be built as module
mpls: Fix mpls_gso handler.
r8152: stop submitting intr for -EPROTO
netfilter: nft_reject_bridge: restrict reject to prerouting and input
netfilter: nft_reject_bridge: don't use IP stack to reject traffic
netfilter: nf_reject_ipv6: split nf_send_reset6() in smaller functions
netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions
netfilter: nf_tables_bridge: update hook_mask to allow {pre,post}routing
drivers/net: macvtap and tun depend on INET
drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
drivers/net: Disable UFO through virtio
net: skb_fclone_busy() needs to detect orphaned skb
gre: Use inner mac length when computing tunnel length
mlx4: Avoid leaking steering rules on flow creation error flow
net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
...
Pull scheduler fixes from Ingo Molnar:
"Various scheduler fixes all over the place: three SCHED_DL fixes,
three sched/numa fixes, two generic race fixes and a comment fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/dl: Fix preemption checks
sched: Update comments for CLONE_NEWNS
sched: stop the unbound recursion in preempt_schedule_context()
sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
sched/fair: Care divide error in update_task_scan_period()
sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
sched: Fix race between task_group and sched_task_group
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, plus on the kernel side:
- a revert for a newly introduced PMU driver which isn't complete yet
and where we ran out of time with fixes (to be tried again in
v3.19) - this makes up for a large chunk of the diffstat.
- compilation warning fixes
- a printk message fix
- event_idx usage fixes/cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Trivial typo fix for --demangle
perf tools: Fix report -F dso_from for data without branch info
perf tools: Fix report -F dso_to for data without branch info
perf tools: Fix report -F symbol_from for data without branch info
perf tools: Fix report -F symbol_to for data without branch info
perf tools: Fix report -F mispredict for data without branch info
perf tools: Fix report -F in_tx for data without branch info
perf tools: Fix report -F abort for data without branch info
perf tools: Make CPUINFO_PROC an array to support different kernel versions
perf callchain: Use global caching provided by libunwind
perf/x86/intel: Revert incomplete and undocumented Broadwell client support
perf/x86: Fix compile warnings for intel_uncore
perf: Fix typos in sample code in the perf_event.h header
perf: Fix and clean up initialization of pmu::event_idx
perf: Fix bogus kernel printk
perf diff: Add missing hists__init() call at tool start
Pull futex fixes from Ingo Molnar:
"This contains two futex fixes: one fixes a race condition, the other
clarifies shared/private futex comments"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Fix a race condition between REQUEUE_PI and task death
futex: Mention key referencing differences between shared and private futexes
Pull core fixes from Ingo Molnar:
"The tree contains two RCU fixes and a compiler quirk comment fix"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Make rcu_barrier() understand about missing rcuo kthreads
compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
rcu: More on deadlock between CPU hotplug and expedited grace periods
Pull timer fixes from Thomas Gleixner:
"As you requested in the rc2 release mail the timer department serves
you a few real bug fixes:
- Fix the probe logic of the architected arm/arm64 timer
- Plug a stack info leak in posix-timers
- Prevent a shift out of bounds issue in the clockevents core"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ARM/ARM64: arch-timer: fix arch_timer_probed logic
clockevents: Prevent shift out of bounds
posix-timers: Fix stack info leak in timer_create()
tracing system does not support that and without checks, it can cause
an oops to be reported.
Rabin Vincent added checks in the return code on syscall events to make
sure that the system call number is within the range that tracing
knows about, and if not, simply ignores the system call.
The system call tracing infrastructure needs to be rewritten to handle these
cases better, but for now, to keep from oopsing, this patch will do.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUUt+4AAoJEEjnJuOKh9ld3HgH/0RL7neY1tp05+v0GRvABmGr
6T47GEmZi9NiQOWjFC4SxNHLQSjpQX7eLD2CC6bljDfFpgKiIqarWHegEBUoBF9K
Dlg2jPpCwwwKbTXlAKTmv9QTGzvBEYyVZxhSC7mEbziV4Rbt7CVZJlogVdeYP5y0
4mWyHJg11Dt9SiZJCIv8sIrx2Xka2eX+Aq30dwYd9JGco3vVCH8NZ09ZgYBHaxIm
YrL6yUVnHP3nqKiEL4qCMUqUzexzdwUhrGPddLANaSRTWT+EAGYPD113bA76jAKc
cd3eaFwFkmCA0yfmjjBSb23FsPvKHc7j6BtZA6Q3uKPZUVlX+DyVNisUfEnaLQs=
=9NTR
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"ARM has system calls outside the NR_syscalls range, and the generic
tracing system does not support that and without checks, it can cause
an oops to be reported.
Rabin Vincent added checks in the return code on syscall events to
make sure that the system call number is within the range that tracing
knows about, and if not, simply ignores the system call.
The system call tracing infrastructure needs to be rewritten to handle
these cases better, but for now, to keep from oopsing, this patch will
do"
* tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/syscalls: Ignore numbers outside NR_syscalls' range
The file /sys/kernel/debug/tracing/eneabled_functions is used to debug
ftrace function hooks. Add to the output what function is being called
by the trampoline if the arch supports it.
Add support for this feature in x86_64.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current method of handling multiple function callbacks is to register
a list function callback that calls all the other callbacks based on
their hash tables and compare it to the function that the callback was
called on. But this is very inefficient.
For example, if you are tracing all functions in the kernel and then
add a kprobe to a function such that the kprobe uses ftrace, the
mcount trampoline will switch from calling the function trace callback
to calling the list callback that will iterate over all registered
ftrace_ops (in this case, the function tracer and the kprobes callback).
That means for every function being traced it checks the hash of the
ftrace_ops for function tracing and kprobes, even though the kprobes
is only set at a single function. The kprobes ftrace_ops is checked
for every function being traced!
Instead of calling the list function for functions that are only being
traced by a single callback, we can call a dynamically allocated
trampoline that calls the callback directly. The function graph tracer
already uses a direct call trampoline when it is being traced by itself
but it is not dynamically allocated. It's trampoline is static in the
kernel core. The infrastructure that called the function graph trampoline
can also be used to call a dynamically allocated one.
For now, only ftrace_ops that are not dynamically allocated can have
a trampoline. That is, users such as function tracer or stack tracer.
kprobes and perf allocate their ftrace_ops, and until there's a safe
way to free the trampoline, it can not be used. The dynamically allocated
ftrace_ops may, although, use the trampoline if the kernel is not
compiled with CONFIG_PREEMPT. But that will come later.
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.
Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a space between subj= and feature= fields to make them parsable.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
verifier keeps track of register state spilled to stack.
registers are 8-byte wide and always aligned, so instead of tracking them
in every byte-sized stack slot, use MAX_BPF_STACK / 8 array to track
spilled register state.
Though verifier runs in user context and its state freed immediately
after verification, it makes sense to reduce its memory usage.
This optimization reduces sizeof(struct verifier_state)
from 12464 to 1712 on 64-bit and from 6232 to 1112 on 32-bit.
Note, this patch doesn't change existing limits, which are there to bound
time and memory during verification: 4k total number of insns in a program,
1k number of jumps (states to visit) and 32k number of processed insn
(since an insn may be visited multiple times). Theoretical worst case memory
during verification is 1712 * 1k = 17Mbyte. Out-of-memory situation triggers
cleanup and rejects the program.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull two RCU fixes from Paul E. McKenney:
" - Complete the work of commit dd56af42bd (rcu: Eliminate deadlock
between CPU hotplug and expedited grace periods), which was
intended to allow synchronize_sched_expedited() to be safely
used when holding locks acquired by CPU-hotplug notifiers.
This commit makes the put_online_cpus() avoid the deadlock
instead of just handling the get_online_cpus().
- Complete the work of commit 35ce7f29a4 (rcu: Create rcuo
kthreads only for onlined CPUs), which was intended to allow
RCU to avoid allocating unneeded kthreads on systems where the
firmware says that there are more CPUs than are really present.
This commit makes rcu_barrier() aware of the mismatch, so that
it doesn't hang waiting for non-existent CPUs. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Found this in the message log on a s390 system:
BUG kmalloc-192 (Not tainted): Poison overwritten
Disabling lock debugging due to kernel taint
INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
__slab_alloc.isra.47.constprop.56+0x5f6/0x658
kmem_cache_alloc_trace+0x106/0x408
call_usermodehelper_setup+0x70/0x128
call_usermodehelper+0x62/0x90
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
__slab_free+0x94/0x560
kfree+0x364/0x3e0
call_usermodehelper_exec+0x110/0x1b8
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
There is a use-after-free bug on the subprocess_info structure allocated
by the user mode helper. In case do_execve() returns with an error
____call_usermodehelper() stores the error code to sub_info->retval, but
sub_info can already have been freed.
Regarding UMH_NO_WAIT, the sub_info structure can be freed by
__call_usermodehelper() before the worker thread returns from
do_execve(), allowing memory corruption when do_execve() failed after
exec_mmap() is called.
Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
call_usermodehelper_exec() to continue which then frees sub_info.
To fix this race the code needs to make sure that the call to
call_usermodehelper_freeinfo() is always done after the last store to
sub_info->retval.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following up the arm testing of gcov, turns out gcov on ARM64 works fine
as well. Only change needed is adding ARM64 to Kconfig depends.
Tested with qemu and mach-virt
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
contains checks for the case where CPUs are brought online out of
order, re-wiring the rcuo leader-follower relationships as needed.
Unfortunately, this rewiring was broken. This apparently went undetected
due to the tendency of systems to bring CPUs online in order. This commit
nevertheless fixes the rewiring.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads. This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after
TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU
and uses PREEMPT_RCU config option in its place.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Rename CONFIG_RCU_BOOST_PRIO to CONFIG_RCU_KTHREAD_PRIO and use this
value for both the per-CPU kthreads (rcuc/N) and the rcu boosting
threads (rcub/n).
Also, create the module_parameter rcutree.kthread_prio to be used on
the kernel command line at boot to set a new value (rcutree.kthread_prio=N).
Signed-off-by: Clark Williams <clark.williams@gmail.com>
[ paulmck: Ported to rcu/dev, applied Paul Bolle and Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
__cleanup_sighand() frees sighand without RCU grace period. This is
correct but this looks "obviously buggy" and constantly confuses the
readers, add the comments to explain how this works.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The kill_pid_info() can potentially loop indefinitely if tasks are created
and deleted sufficiently quickly, and if this happens, this function
will remain in a single RCU read-side critical section indefinitely.
This commit therefore exits the RCU read-side critical section on each
pass through the loop. Because a race must happen to retry the loop,
this should have no performance impact in the common case.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
During the 3.18 merge period additional __get_cpu_var uses were
added. The patch converts these to this_cpu_ptr().
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
ktime_get_real_seconds() is the replacement function for get_seconds()
returning the seconds portion of CLOCK_REALTIME in a time64_t. For
64bit the function is equivivalent to get_seconds(), but for 32bit it
protects the readout with the timekeeper sequence count. This is
required because 32-bit machines cannot access 64-bit tk->xtime_sec
variable atomically.
[tglx: Massaged changelog and added docbook comment ]
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/7adcfaa8962b8ad58785d9a2456c3f77d93c0ffb.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is the counterpart to get_seconds() based on CLOCK_MONOTONIC. The
use case for this interface are kernel internal coarse grained
timestamps which do neither require the nanoseconds fraction of
current time nor the CLOCK_REALTIME properties. Such timestamps can
currently only retrieved by calling ktime_get_ts64() and using the
tv_sec field of the returned timespec64. That's inefficient as it
involves the read of the clocksource, math operations and must be
protected by the timekeeper sequence counter.
To avoid the sequence counter protection we restrict the return value
to unsigned 32bit on 32bit machines. This covers ~136 years of uptime
and therefor an overflow is not expected to hit anytime soon.
To avoid math in the function we calculate the current seconds portion
of CLOCK_MONOTONIC when the timekeeper gets updated in
tk_update_ktime_data() similar to the CLOCK_REALTIME counterpart
xtime_sec.
[ tglx: Massaged changelog, simplified and commented the update
function, added docbook comment ]
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/da0b63f4bdf3478909f92becb35861197da3a905.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, synchronize_sched_expedited() sends IPIs to all online CPUs,
even those that are idle or executing in nohz_full= userspace. Because
idle CPUs and nohz_full= userspace CPUs are in extended quiescent states,
there is no need to IPI them in the first place. This commit therefore
avoids IPIing CPUs that are already in extended quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There are some RCU_BOOST-specific per-CPU variable declarations that
are needlessly defined under #ifdef in kernel/rcu/tree.c. This commit
therefore moves these declarations into a pre-existing #ifdef in
kernel/rcu/tree_plugin.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CONFIG_RCU_CPU_STALL_VERBOSE Kconfig parameter causes preemptible
RCU's CPU stall warnings to dump out any preempted tasks that are blocking
the current RCU grace period. This information is useful, and the default
has been CONFIG_RCU_CPU_STALL_VERBOSE=y for some years. It is therefore
time for this commit to remove this Kconfig parameter, so that future
kernel builds will always act as if CONFIG_RCU_CPU_STALL_VERBOSE=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
the accounting of the ftrace_ops trampoline logic. One was that the
old hash was not updated before calling the modify code for an ftrace_ops.
The second bug was what let the first bug go unnoticed, as the update would
check the current hash for all ftrace_ops (where it should only check the
old hash for modified ones). This let things work when only one ftrace_ops
was registered to a function, but could break if more than one was
registered depending on the order of the look ups.
The worse thing that can happen if this bug triggers is that the ftrace
self checks would find an anomaly and shut itself down.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUToYWAAoJEEjnJuOKh9ldfS8H/36CL5E+4itux9tIhf13Untj
FSi3EzvEdrTYu7IhdyRB6N7cp07g79jU3v40ZDLxDHzG2i4VLft/Z3uzIC0Z6mhL
kJZCCWpUTAKJO/UPFcenEZ7eiL+B+5QVOc1Oxcet0odG5HWkEZG62va/MrhB9k/7
uUNRqXNjg7w2rG0TK2qjcTHiPGJ9h7/wG9RgYktAIs27BUmip5sRS1IMyFL51Gpo
UNtIKGtG6/4hizdlHhWBuAa6ErM37GPskx3iP/45xiAu3J8SIbOk1FBe+4Xk+DZQ
hZK479hzlk6OU/M2vDJefG1d6zeQ7y00LMkUIAPiUEgayXAXpYX7UjV13CLQeGU=
=HrhJ
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace trampoline accounting fixes from Steven Rostedt:
"Adding the new code for 3.19, I discovered a couple of minor bugs with
the accounting of the ftrace_ops trampoline logic.
One was that the old hash was not updated before calling the modify
code for an ftrace_ops. The second bug was what let the first bug go
unnoticed, as the update would check the current hash for all
ftrace_ops (where it should only check the old hash for modified
ones). This let things work when only one ftrace_ops was registered
to a function, but could break if more than one was registered
depending on the order of the look ups.
The worse thing that can happen if this bug triggers is that the
ftrace self checks would find an anomaly and shut itself down"
* tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix checking of trampoline ftrace_ops in finding trampoline
ftrace: Set ops->old_hash on modifying what an ops hooks to
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a4, this could result in huge numbers of useless
rcuo kthreads being created.
It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.
It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.
Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().
So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.
Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.
In particular, preemption preserves task_struct::state.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.
Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.
Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a genuine bug in add_unformed_module(), we cannot use blocking
primitives inside a wait loop.
So rewrite the wait_event_interruptible() usage to use the fresh
wait_woken() stuff.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20140924082242.458562904@infradead.org
[ So this is probably complex to backport and the race wasn't reported AFAIK,
so not marked for -stable. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
smp_hotplug_thread::{setup,unpark} functions can sleep too, so be
consistent and do the same for all callbacks.
__might_sleep+0x74/0x80
kmem_cache_alloc_trace+0x4e/0x1c0
perf_event_alloc+0x55/0x450
perf_event_create_kernel_counter+0x2f/0x100
watchdog_nmi_enable+0x8d/0x160
watchdog_enable+0x45/0x90
smpboot_thread_fn+0xec/0x2b0
kthread+0xe4/0x100
ret_from_fork+0x7c/0xb0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.392279328@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
up calling potential sleeps before we reset it.
Not strictly a bug since we're guaranteed to exit the loop and not
call schedule(); put in annotations to quiet might_sleep().
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270
Call Trace:
[<ffffffff81694991>] dump_stack+0x4e/0x7a
[<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff810bca6e>] __might_sleep+0x7e/0x90
[<ffffffff811a1c15>] might_fault+0x55/0xb0
[<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
[<ffffffff8109a804>] do_wait+0x104/0x270
[<ffffffff8109b837>] SyS_wait4+0x77/0x100
[<ffffffff8169d692>] system_call_fastpath+0x16/0x1b
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: umgwanakikbuti@gmail.com
Cc: ilya.dryomov@inktank.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.
We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.
There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.
Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy reported that the current state of event_idx is rather confused.
So remove all but the x86_pmu implementation and change the default to
return 0 (the safe option).
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Cody P Schafer <dev@codyps.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Himangi Saraogi <himangi774@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sukadev@linux.vnet.ibm.com <sukadev@linux.vnet.ibm.com>
Cc: Thomas Huth <thuth@linux.vnet.ibm.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux390@de.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>