Commit Graph

20305 Commits

Author SHA1 Message Date
Thomas Gleixner b484403b9a sched: debug: Remove the cfs bandwidth timer_active printout
The struct member is gone.

Reported-by: fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-04-23 13:58:09 +02:00
Linus Torvalds 27cf3a16b2 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Seven audit patches for v4.1, all bug fixes.

  The largest, and perhaps most significant commit helps resolve some
  memory pressure issues related to the inode cache and audit, there are
  also a few small commits which help resolve some timing issues with
  the audit log queue, and the rest fall into the always popular "code
  clean-up" category.

  In general, nothing really substantial, just a nice set of maintenance
  patches"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: Remove condition which always evaluates to false
  audit: reduce mmap_sem hold for mm->exe_file
  audit: consolidate handling of mm->exe_file
  audit: code clean up
  audit: don't reset working wait time accidentally with auditd
  audit: don't lose set wait time on first successful call to audit_log_start()
  audit: move the tree pruning to a dedicated thread
2015-04-22 14:49:23 -07:00
kbuild test robot 9183034879 perf: perf_mux_hrtimer_cancel() can be static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: kbuild-all@01.org
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150422200000.GA122603@lkp-sb04
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 22:06:34 +02:00
Linus Torvalds 4f2112351b This adds three fixes for the tracing code.
The first is a bug when ftrace_dump_on_oops is triggered in atomic context
 and function graph tracer is the tracer that is being reported.
 
 The second fix is bad parsing of the trace_events from the kernel
 command line, where it would ignore specific events if the system
 name is used with defining the event(it enables all events within the
 system).
 
 The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was missing
 to see if the ptr was incremented to the end of the string, but the loop
 increments it again and can miss the nul delimiter to stop processing.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVN7g7AAoJEEjnJuOKh9ldLCkIAJj6wsj59wR8aIxtxnD5bJZZ
 Y9dKkas6BOaUCGp0MVBCkpm3uMgh0uPO10jOuthgc3LgQy0piwKJrIzGkI5gcRuA
 JvZw2X08/jCSNu8BHydes2A0XMkZobMuWFAeWz3CSzNBfbI3sDDqgnRQ9eyMM66R
 +sPV7ZELQLK/ZFs93gFoaZe/OKGYJcamhMtG2v7p9h1qBJZUDakRFo478bAwu5SB
 Kh6xpXA76WnmkeQekAAsWGdqIQzoNy3IoVePmdhVpZvoiKLUrf1JWVYFoNkO39Ik
 kC1gqJro0EmHQFOo5rt8yQmqXjvSU0sS/sAW3HZ0Szl5OtiUNnag+Wt3ku7lr8U=
 =VORa
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This adds three fixes for the tracing code.

  The first is a bug when ftrace_dump_on_oops is triggered in atomic
  context and function graph tracer is the tracer that is being
  reported.

  The second fix is bad parsing of the trace_events from the kernel
  command line, where it would ignore specific events if the system name
  is used with defining the event(it enables all events within the
  system).

  The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was
  missing to see if the ptr was incremented to the end of the string,
  but the loop increments it again and can miss the nul delimiter to
  stop processing"

* tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix possible out of bounds memory access when parsing enums
  tracing: Fix incorrect enabling of trace events by boot cmdline
  tracing: Handle ftrace_dump() atomic context in graph_trace_open()
2015-04-22 11:27:36 -07:00
Linus Torvalds 15ce2658dd Quentin opened a can of worms by adding extable entry checking to modpost,
but most architectures seem fixed now.  Thanks to all involved.
 
 Last minute rebase because I noticed a "[PATCH]" had snuck into a commit
 message somehow.
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVN1X3AAoJENkgDmzRrbjxmBEP/1Kv1wCrUGg07TfqNTJgr5Pe
 R81C+8P2wSEsA94lBhLNbn2DsN7uWsygcSXZMQ+5CrmbKAG45kRlMUzSlg6IdGXo
 usA9rOlH4hR+Dqn3zyx543xMJyp0a0LoSCLWGMmr/U71AXNp5kehvbIv64gJHrwg
 4CALI9qCwHH4FXF7WBYdLknlfpmD7KzCjIVckkfUfwf2A8mZXKdqBFQ+MPqbChEm
 vpFWkgp1oJgK4OcvvBeiNT2PwFZy1vmJtQWc2lieJ6HCjbqvwgM0g3EYTIQmccIm
 O40mEFMNkUuliMtrCZgOJBP2bmWy79JydiUq5Mu5Cb4bL6jGl2r6Z+w+iQ7Sggv9
 DTqTrk0/Cn02krJXDXAHVOXXgTqggf1xev6wr+3dbZn5sytQ8ddImuuVai8KALoC
 aEXAsnAK+3GnkAn9qqKVFhnrnCcI69I6srUAn5DKVWu36+Ye05FsZboIFYcxcK1F
 NZ+31xA17xA+yoo62ly/Auz7DXMa2QZfJcoQLMx2fMgL/+wMqzfGbcrZ5ZaVIj4w
 ePig1SbozAweAAfX7+W+QCQ3wqNN/nJgMO8vPldlcq7eaXH0/Dqu72cC2d82dK+H
 xl/1pEj83W0AkUkfNS8j3Z4zSzg31g8bK5SmPPRID00mx8roCfsP/Cfx5hRiO8U2
 ECtbKDxSyUWwF9vnOLVR
 =EYrW
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Quentin opened a can of worms by adding extable entry checking to
  modpost, but most architectures seem fixed now.  Thanks to all
  involved.

  Last minute rebase because I noticed a "[PATCH]" had snuck into a
  commit message somehow"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost: don't emit section mismatch warnings for compiler optimizations
  modpost: expand pattern matching to support substring matches
  modpost: do not try to match the SHT_NUL section.
  modpost: fix extable entry size calculation.
  modpost: fix inverted logic in is_extable_fault_address().
  modpost: handle -ffunction-sections
  modpost: Whitelist .text.fixup and .exception.text
  params: handle quotes properly for values not of form foo="bar".
  modpost: document the use of struct section_check.
  modpost: handle relocations mismatch in __ex_table.
  scripts: add check_extable.sh script.
  modpost: mismatch_handler: retrieve tosym information only when needed.
  modpost: factorize symbol pretty print in get_pretty_name().
  modpost: add handler function pointer to sectioncheck.
  modpost: add .sched.text and .kprobes.text to the TEXT_SECTIONS list.
  modpost: add strict white-listing when referencing sections.
  module: do not print allocation-fail warning on bogus user buffer size
  kernel/module.c: fix typos in message about unused symbols
2015-04-22 09:49:24 -07:00
Peter Zijlstra 272325c482 perf: Fix mux_interval hrtimer wreckage
Thomas stumbled over the hrtimer_forward_now() in
perf_event_mux_interval_ms_store() and noticed its broken-ness.

You cannot just change the expiry time of an active timer, it will
destroy the red-black tree order and cause havoc.

Change it to (re)start the timer instead, (re)starting a timer will
dequeue and enqueue a timer and therefore preserve rb-tree order.

Since we cannot enqueue remotely, wrap the thing in
cpu_function_call(), this however mandates that we restrict ourselves
to online cpus. Also serialize the entire setting so we don't get
multiple concurrent threads trying to update to different values.

Also fix a problem in perf_mux_hrtimer_restart(), checking against
hrtimer_active() can actually loose us the timer when timer->state ==
HRTIMER_STATE_CALLBACK and the callback has already decided NORESTART.

Furthermore it doesn't make any sense to test
hrtimer_callback_running() when we already tested hrtimer_active(),
but with the above change, we explicitly must call it when
callback_running.

Lastly, rename a few functions:

  s/perf_cpu_hrtimer_/perf_mux_hrtimer_/ -- because I could not find
                                            the mux timer function

  s/\<hr\>/timer/ -- because that's the normal way of calling things.

Fixes: 62b8563979 ("perf: Add sysfs entry to adjust multiplexing interval per PMU")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150415095011.863052571@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:12:22 +02:00
Peter Zijlstra 77a4d1a1b9 sched: Cleanup bandwidth timers
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally lose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.

Update the rt bandwidth timer to match.

This effectively reverts: 09dc4ab039 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").

Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:53 +02:00
Peter Zijlstra 5de2755c8c hrtimer: Allow concurrent hrtimer_start() for self restarting timers
Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.

Fixes: 2d44ae4d71 ("hrtimer: clean up cpu->base locking tricks")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415113105.GT5029@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner 2ad5d3272d timer: Put usleep_range into the __sched section
do_usleep_range() and schedule_hrtimeout_range() are __sched as
well. So it makes no sense to have the exported function in a
different section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.833709502@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner 6deba083e1 timer: Remove pointless return value of do_usleep_range()
The only user ignores it anyway and rightfully so.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.756060258@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner 19d9f4225d hrtimer: Avoid locking in hrtimer_cancel() if timer not active
We can do a lockless check for hrtimer_active before actually taking
the lock in hrtimer[_try_to]_cancel. This is useful for hotpath users
like nanosleep as they avoid the lock dance when the timer has
expired.

This is safe because active is true when the timer is enqueued or the
callback is running. Taking the hrtimer base lock does not protect
against concurrent hrtimer_start calls, the callsite has to do the
proper serialization itself.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.580273114@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner 61699e1307 hrtimer: Remove hrtimer_start() return value
No user was ever interested whether the timer was active or not when
it was started. All abusers of the return value are gone, so get rid
of it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.483556394@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner b8a62f1ff0 tick: broadcast-hrtimer: Remove overly clever return value abuse
The assignment of bc_moved in the conditional construct relies on the
fact that in the case of hrtimer_start() invocation the return value
is always 0. It took me a while to understand it.

We want to get rid of the hrtimer_start() return value. Open code the
logic which makes it readable as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.404751457@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner b193217e6d alarmtimer: Get rid of unused return value
We want to get rid of the hrtimer_start() return value and the alarm
timer return value is nowhere used. Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20150414203503.243910615@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner ccdd92c17e rtmutex: Remove bogus hrtimer_active() check
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.081830481@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner 2e4b0d3fe8 futex: Remove bogus hrtimer_active() check
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.985825453@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner 3f7b349ac1 hrtimer: Remove bogus hrtimer_active() check
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.907149271@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner 02a171af1a hrtimer: Make hrtimer_start() a inline wrapper
No point for an extra export just to set the extra argument of
hrtimer_start_range_ns() to 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.808544539@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner 58f1f803f1 hrtimer: Get rid of __hrtimer_start_range_ns()
No more callers. Remove the leftovers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.707871492@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner cc9684d3c1 sched: deadline: Use hrtimer_start()
hrtimer_start() does not longer defer already expired timers to the
softirq. Get rid of the __hrtimer_start_range_ns() invocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.627353666@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner 4961b6e118 sched: core: Use hrtimer_start[_expires]()
hrtimer_start() now enforces a timer interrupt when an already expired
timer is enqueued.

Get rid of the __hrtimer_start_range_ns() invocations and the loops
around it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.531131739@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner 3497d206c4 perf: core: Use hrtimer_start()
hrtimer_start() does not longer defer already expired timers to the
softirq. Get rid of the __hrtimer_start_range_ns() invocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203502.452104213@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:51 +02:00
Thomas Gleixner c1ad348b45 tick: Nohz: Rework next timer evaluation
The evaluation of the next timer in the nohz code is based on jiffies
while all the tick internals are nano seconds based. We have also to
convert hrtimer nanoseconds to jiffies in the !highres case. That's
just wrong and introduces interesting corner cases.

Turn it around and convert the next timer wheel timer expiry and the
rcu event to clock monotonic and base all calculations on
nanoseconds. That identifies the case where no timer is pending
clearly with an absolute expiry value of KTIME_MAX.

Makes the code more readable and gets rid of the jiffies magic in the
nohz code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner 157d29e101 tick: Sched: Restructure code
Get rid of one indentation level. Preparatory patch for a major
rework. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203502.101563235@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner 0ff53d0964 tick: sched: Force tick interrupt and get rid of softirq magic
We already got rid of the hrtimer reprogramming loops and hoops as
hrtimer now enforces an interrupt if the enqueued time is in the past.

Do the same for the nohz non highres mode. That gets rid of the need
to raise the softirq which only serves the purpose of getting the
machine out of the inner idle loop.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203502.023464878@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner afc08b15cc tick: sched: Remove hrtimer_active() checks
hrtimer_start() enforces a timer interrupt if the timer is already
expired. Get rid of the checks and the forward loop.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203501.943658239@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner c6eb3f70d4 hrtimer: Get rid of hrtimer softirq
hrtimer softirq is a leftover from the initial implementation and
serves only the purpose to handle the enqueueing of already expired
timers in the high resolution timer mode. We discussed whether we
change the return value and force all start sites to handle that the
timer is already expired, but that would be a Herculean task and I'm
not sure whether its a good idea to enforce that handling on
everyone.

A simpler solution is to enforce a timer interrupt instead of raising
and scheduling a softirq. Just use the existing infrastructure to do
so and remove all the softirq leftovers.

The HRTIMER softirq enum is now unused, but kept around because trace
parsers rely on the existing numbering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.840834708@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner 895bdfa793 hrtimer: Keep pointer to first timer and simplify __remove_hrtimer()
__remove_hrtimer() needs to evaluate the expiry time to figure out
whether the timer which is removed is eventually the first expiring
timer on the cpu. Keep a pointer to it, which is lazily updated, so we
can avoid the evaluation dance and retrieve the information from there.

Generates slightly better code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.752838019@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner b97f44c9b6 hrtimer: Make use of timerqueue_add/del return values
Use the return value instead of reevaluating the information.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.658152945@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner 34aee88a02 hrtimer: Use cpu_base->active_base for hotpath iterators
The active_bases field is guaranteed to be in sync with the timerqueue
of the corresponding clock base. So we can use it for iterating over
the clock bases. This allows to break out early if no more active
clock bases are available and avoids touching the cache lines of
inactive clock bases.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.322887675@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:49 +02:00
Thomas Gleixner e19ffe8be2 hrtimer: Use bits for various boolean indicators
No point in wasting 12 byte storage space. Generates better code as well.

Text size reduction:
       x8664 -64, i386 -16, ARM -132, ARM64 -0, power64 -48

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.227955358@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:49 +02:00
Thomas Gleixner 868a3e915f hrtimer: Make offset update smarter
On every tick/hrtimer interrupt we update the offset variables of the
clock bases. That's silly because these offsets change very seldom.

Add a sequence counter to the time keeping code which keeps track of
the offset updates (clock_was_set()). Have a sequence cache in the
hrtimer cpu bases to evaluate whether the offsets must be updated or
not. This allows us later to avoid pointless cacheline pollution.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20150414203501.132820245@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
2015-04-22 17:06:49 +02:00
Thomas Gleixner 21d6d52a1b hrtimer: Get rid of softirq time
The softirq time field in the clock bases is an optimization from the
early days of hrtimers. It provides a coarse "jiffies" like time
mostly for self rearming timers.

But that comes with a price:
    - Larger code size
    - Extra storage space
    - Duplicated functions with really small differences
   
The benefit of this is optimization is marginal for contemporary
systems.

Consolidate everything on the high resolution timer
implementation. This makes further optimizations possible.

Text size reduction:
       x8664 -95, i386 -356, ARM -148, ARM64 -40, power64 -16

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.039977424@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:49 +02:00
Thomas Gleixner a6ffebce7f hrtimer: Make the statistics fields smaller
No point in having usigned long for /proc/timer_list statistics. Make
them unsigned int.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203500.959773467@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:49 +02:00
Thomas Gleixner 056a3cacbc hrtimer: Get rid of hrtimer_get_res()
The resolution is directly accessible now. So its simpler just to fill
in the values of the timespec and be done with it.

Text size reduction (combined with "hrtimer: Get rid of the resolution
field in hrtimer_clock_base"):
       x8664 -61, i386 -221, ARM -60, power64 -48

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203500.879888080@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:49 +02:00
Thomas Gleixner 398ca17fb5 hrtimer: Get rid of the resolution field in hrtimer_clock_base
The field has no value because all clock bases have the same
resolution. The resolution only changes when we switch to high
resolution timer mode. We can evaluate that from a single static
variable as well. In the !HIGHRES case its simply a constant.

Export the variable, so we can simplify the usage sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203500.645454122@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:48 +02:00
Viresh Kumar d9f0acdeef hrtimer: Update active_bases before calling hrtimer_force_reprogram()
'active_bases' indicates which clock-base have active timer. The
intention of this bit field was to avoid evaluating inactive bases. It
was introduced with the introduction of the BOOTTIME and TAI clock
bases, but it was never brought into full use.

We want to use it now, but in __remove_hrtimer() the update happens
after the calling hrtimer_force_reprogram() which has to evaluate all
clock bases for the next expiring timer. So in case the last timer of
a clock base got removed we still see the active bit and therefor
evaluate the clock base for no value. There are further optimizations
possible when active_bases is updated in the right place.

Move the update before the call to hrtimer_force_reprogram()

[ tglx: Massaged changelog ]

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/20150414203500.533438642@linutronix.de
Link: http://lkml.kernel.org/r/c7c8ebcd9ed88bb09d76059c745a1fafb48314e7.1428039899.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:48 +02:00
Thomas Gleixner 91e5a2170e hrtimer: Document hrtimer_forward[_now]() proper
Document the calling context conditions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150413210035.178751779@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:48 +02:00
Thomas Gleixner 51a03393ba timekeeping: Remove stale function prototype
commit 61edec81d2 "timekeeping: Simplify timekeeping_clocktai()"
implemented timekeeping_clocktai() as an inline function, but left the
old extern prototype in the header file. Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 12:03:39 +02:00
Joe Perches 7de4e74430 timer_list: Reduce SEQ_printf footprint
This macro can be converted to a static function to reduce
object size.

(x86-64 defconfig)
$ size kernel/time/timer_list.o*
   text	   data	    bss	    dec	    hex	filename
   6583	      8	      0	   6591	   19bf	kernel/time/timer_list.o.old
   4647	      8	      0	   4655	   122f	kernel/time/timer_list.o.new

Signed-off-by: Joe Perches <joe@perches.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1429295958.2850.104.camel@perches.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 12:03:39 +02:00
Linus Torvalds 1fc149933f Char/Misc driver patches for 4.1-rc1
Here's the big char/misc driver patchset for 4.1-rc1.
 
 Lots of different driver subsystem updates here, nothing major, full
 details are in the shortlog below.
 
 All of this has been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlU2IMEACgkQMUfUDdst+yloDQCfbyIRL23WVAn9ckQse/y8gbjB
 OT4AoKTJbwndDP9Kb/lrj2tjd9QjNVrC
 =xhen
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here's the big char/misc driver patchset for 4.1-rc1.

  Lots of different driver subsystem updates here, nothing major, full
  details are in the shortlog.

  All of this has been in linux-next for a while"

* tag 'char-misc-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (133 commits)
  mei: trace: remove unused TRACE_SYSTEM_STRING
  DTS: ARM: OMAP3-N900: Add lis3lv02d support
  Documentation: DT: lis302: update wakeup binding
  lis3lv02d: DT: add wakeup unit 2 and wakeup threshold
  lis3lv02d: DT: use s32 to support negative values
  Drivers: hv: hv_balloon: correctly handle num_pages>INT_MAX case
  Drivers: hv: hv_balloon: correctly handle val.freeram<num_pages case
  mei: replace check for connection instead of transitioning
  mei: use mei_cl_is_connected consistently
  mei: fix mei_poll operation
  hv_vmbus: Add gradually increased delay for retries in vmbus_post_msg()
  Drivers: hv: hv_balloon: survive ballooning request with num_pages=0
  Drivers: hv: hv_balloon: eliminate jumps in piecewiese linear floor function
  Drivers: hv: hv_balloon: do not online pages in offline blocks
  hv: remove the per-channel workqueue
  hv: don't schedule new works in vmbus_onoffer()/vmbus_onoffer_rescind()
  hv: run non-blocking message handlers in the dispatch tasklet
  coresight: moving to new "hwtracing" directory
  coresight-tmc: Adding a status interface to sysfs
  coresight: remove the unnecessary configuration coresight-default-sink
  ...
2015-04-21 09:42:58 -07:00
Linus Torvalds 41d5e08ea8 TTY/Serial patches for 4.1-rc1
Here's the big tty/serial driver update for 4.1-rc1.
 
 It was delayed for a bit due to some questions surrounding some of the
 console command line parsing changes that are in here.  There's still
 one tiny regression for people who were previously putting multiple
 console command lines and expecting them all to be ignored for some odd
 reason, but Peter is working on fixing that.  If not, I'll send a revert
 for the offending patch, but I have faith that Peter can address it.
 
 Other than the console work here, there's the usual serial driver
 updates and changes, and a buch of 8250 reworks to try to make that
 driver easier to maintain over time, and have it support more devices in
 the future.
 
 All of these have been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlU2IcUACgkQMUfUDdst+ylFqACcC8LPhFEZg9aHn0hNUoqGK3rE
 5dUAnR4b8r/NYqjVoE9FJZgZfB/TqVi1
 =lyN/
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial updates from Greg KH:
 "Here's the big tty/serial driver update for 4.1-rc1.

  It was delayed for a bit due to some questions surrounding some of the
  console command line parsing changes that are in here.  There's still
  one tiny regression for people who were previously putting multiple
  console command lines and expecting them all to be ignored for some
  odd reason, but Peter is working on fixing that.  If not, I'll send a
  revert for the offending patch, but I have faith that Peter can
  address it.

  Other than the console work here, there's the usual serial driver
  updates and changes, and a buch of 8250 reworks to try to make that
  driver easier to maintain over time, and have it support more devices
  in the future.

  All of these have been in linux-next for a while"

* tag 'tty-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (119 commits)
  n_gsm: Drop unneeded cast on netdev_priv
  sc16is7xx: expose RTS inversion in RS-485 mode
  serial: 8250_pci: port failed after wakeup from S3
  earlycon: 8250: Document kernel command line options
  earlycon: 8250: Fix command line regression
  earlycon: Fix __earlycon_table stride
  tty: clean up the tty time logic a bit
  serial: 8250_dw: only get the clock rate in one place
  serial: 8250_dw: remove useless ACPI ID check
  dmaengine: hsu: move memory allocation to GFP_NOWAIT
  dmaengine: hsu: remove redundant pieces of code
  serial: 8250_pci: add Intel Tangier support
  dmaengine: hsu: add Intel Tangier PCI ID
  serial: 8250_pci: replace switch-case by formula for Intel MID
  serial: 8250_pci: replace switch-case by formula
  tty: cpm_uart: replace CONFIG_8xx by CONFIG_CPM1
  serial: jsm: some off by one bugs
  serial: xuartps: Fix check in console_setup().
  serial: xuartps: Get rid of register access macros.
  serial: xuartps: Fix iobase use.
  ...
2015-04-21 09:33:10 -07:00
Linus Torvalds 5224b9613b smp: Fix error case handling in smp_call_function_*()
Commit 8053871d0f ("smp: Fix smp_call_function_single_async()
locking") fixed the locking for the asynchronous smp-call case, but in
the process of moving the lock handling around, one of the error cases
ended up not unlocking the call data at all.

This went unnoticed on x86, because this is a "caller is buggy" case,
where the caller is trying to call a non-existent CPU.  But apparently
ARM does that (at least under qemu-arm).  Bindly doing cross-cpu calls
to random CPU's that aren't even online seems a bit fishy, but the error
handling was clearly not correct.

Simply add the missing "csd_unlock()" to the error path.

Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Analyzed-by: Rabin Vincent <rabin@rab.in>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-19 13:19:23 -07:00
Linus Torvalds 396c9df223 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Two fixes: an smp-call fix and a lockdep fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Fix smp_call_function_single_async() locking
  lockdep: Make print_lock() robust against concurrent release
2015-04-18 11:23:42 -04:00
Ingo Molnar cb0f3f320d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull RCU fix from Paul E. McKenney:

 "This series contains a single change that fixes Kconfig asking pointless
  questions."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-18 14:49:41 +02:00
Linus Torvalds 388f997620 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix verifier memory corruption and other bugs in BPF layer, from
    Alexei Starovoitov.

 2) Add a conservative fix for doing BPF properly in the BPF classifier
    of the packet scheduler on ingress.  Also from Alexei.

 3) The SKB scrubber should not clear out the packet MARK and security
    label, from Herbert Xu.

 4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.

 5) Pause handling is not correct in the stmmac driver because it
    doesn't take into consideration the RX and TX fifo sizes.  From
    Vince Bridgers.

 6) Failure path missing unlock in FOU driver, from Wang Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  net: dsa: use DEVICE_ATTR_RW to declare temp1_max
  netns: remove BUG_ONs from net_generic()
  IB/ipoib: Fix ndo_get_iflink
  sfc: Fix memcpy() with const destination compiler warning.
  altera tse: Fix network-delays and -retransmissions after high throughput.
  net: remove unused 'dev' argument from netif_needs_gso()
  act_mirred: Fix bogus header when redirecting from VLAN
  inet_diag: fix access to tcp cc information
  tcp: tcp_get_info() should fetch socket fields once
  net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
  skbuff: Do not scrub skb mark within the same name space
  Revert "net: Reset secmark when scrubbing packet"
  bpf: fix two bugs in verification logic when accessing 'ctx' pointer
  bpf: fix bpf helpers to use skb->mac_header relative offsets
  stmmac: Configure Flow Control to work correctly based on rxfifo size
  stmmac: Enable unicast pause frame detect in GMAC Register 6
  stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
  stmmac: Add defines and documentation for enabling flow control
  stmmac: Add properties for transmit and receive fifo sizes
  stmmac: fix oops on rmmod after assigning ip addr
  ...
2015-04-17 16:31:08 -04:00
Steven Rostedt (Red Hat) 3193899d4d tracing: Fix possible out of bounds memory access when parsing enums
The code that replaces the enum names with the enum values in the
tracepoints' format files could possible miss the end of string nul
character. This was caused by processing things like backslashes, quotes
and other tokens. After processing the tokens, a check for the nul
character needed to be done before continuing the loop, because the loop
incremented the pointer before doing the check, which could bypass the nul
character.

Link: http://lkml.kernel.org/r/552E661D.5060502@oracle.com

Reported-by: Sasha Levin <sasha.levin@oracle.com> # via KASan
Tested-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Fixes: 0c564a538a "tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-17 10:34:43 -04:00
Davidlohr Bueso 11163348a2 oprofile: reduce mmap_sem hold for mm->exe_file
sync_buffer() needs the mmap_sem for two distinct operations, both only
occurring upon user context switch handling:

 1) Dealing with the exe_file.

 2) Adding the dcookie data as we need to lookup the vma that
   backs it. This is done via add_sample() and add_data().

This patch isolates 1), for it will no longer need the mmap_sem for
serialization.  However, for now, make of the more standard
get_mm_exe_file(), requiring only holding the mmap_sem to read the value,
and relying on reference counting to make sure that the exe file won't
dissappear underneath us while doing the get dcookie.

As a consequence, for 2) we move the mmap_sem locking into where we really
need it, in lookup_dcookie().  The benefits are twofold: reduce mmap_sem
hold times, and cleaner code.

[akpm@linux-foundation.org: export get_mm_exe_file for arch/x86/oprofile/oprofile.ko]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Robert Richter <rric@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:11 -04:00
Andrey Ryabinin 9d796e6623 gcov: fix softlockups
gcov profiling if enabled with other heavy compile-time instrumentation
like KASan could trigger following softlockups:

  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
  Modules linked in:
  irq event stamp: 22823276
  hardirqs last  enabled at (22823275): [<ffffffff86e8d10d>] mutex_lock_nested+0x7d9/0x930
  hardirqs last disabled at (22823276): [<ffffffff86e9521d>] apic_timer_interrupt+0x6d/0x80
  softirqs last  enabled at (22823172): [<ffffffff811ed969>] __do_softirq+0x4db/0x729
  softirqs last disabled at (22823167): [<ffffffff811edfcf>] irq_exit+0x7d/0x15b
  CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W       3.19.0-05245-gbb33326-dirty #3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
  task: ffff88006cba8000 ti: ffff88006cbb0000 task.ti: ffff88006cbb0000
  RIP: kasan_mem_to_shadow+0x1e/0x1f
  Call Trace:
    strcmp+0x28/0x70
    get_node_by_name+0x66/0x99
    gcov_event+0x4f/0x69e
    gcov_enable_events+0x54/0x7b
    gcov_fs_init+0xf8/0x134
    do_one_initcall+0x1b2/0x288
    kernel_init_freeable+0x467/0x580
    kernel_init+0x15/0x18b
    ret_from_fork+0x7c/0xb0
  Kernel panic - not syncing: softlockup: hung tasks

Fix this by sticking cond_resched() in gcov_enable_events().

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:08 -04:00
Heinrich Schuchardt 230633d109 kernel/sysctl.c: detect overflows when converting to int
When converting unsigned long to int overflows may occur.  These currently
are not detected when writing to the sysctl file system.

E.g. on a system where int has 32 bits and long has 64 bits
  echo 0x800001234 > /proc/sys/kernel/threads-max
has the same effect as
  echo 0x1234 > /proc/sys/kernel/threads-max

The patch adds the missing check in do_proc_dointvec_conv.

With the patch an overflow will result in an error EINVAL when writing to
the the sysctl file system.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:08 -04:00
Davidlohr Bueso 6e399cd144 prctl: avoid using mmap_sem for exe_file serialization
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
of calling set_mm_exe_file() which requires some form of serialization --
mmap_sem in this case.  For archs that do not have atomic rmw instructions
we still fallback to a spinlock alternative, so this should always be
safe.  As such, we only need the mmap_sem for looking up the backing
vm_file, which can be done sharing the lock.  Naturally, this means we
need to manually deal with both the new and old file reference counting,
and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
probably be deleted in the future anyway.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Konstantin Khlebnikov 90f31d0ea8 mm: rcu-protected get_mm_exe_file()
This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.

[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Heinrich Schuchardt 16db3d3f11 kernel/sysctl.c: threads-max observe limits
Users can change the maximum number of threads by writing to
/proc/sys/kernel/threads-max.

With the patch the value entered is checked against the same limits that
apply when fork_init is called.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Heinrich Schuchardt ac1b398de1 kernel/fork.c: avoid division by zero
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
THREAD_SIZE.

E.g.  architecture hexagon may have page size 1M and thread size 4096.
This would lead to a division by zero in the calculation of max_threads.

With 32-bit calculation there is no solution which delivers valid results
for all possible combinations of the parameters.  The code is only called
once.  Hence a 64-bit calculation can be used as solution.

[akpm@linux-foundation.org: use clamp_t(), per Oleg]
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Heinrich Schuchardt ff691f6e03 kernel/fork.c: new function for max_threads
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
THREAD_SIZE.

E.g.  architecture hexagon may have page size 1M and thread size 4096.
This would lead to a division by zero in the calculation of max_threads.

With this patch the buggy code is moved to a separate function
set_max_threads.  The error is not fixed.

After fixing the problem in a separate patch the new function can be
reused to adjust max_threads after adding or removing memory.

Argument mempages of function fork_init() is removed as totalram_pages is
an exported symbol.

The creation of separate patches for refactoring to a new function and for
fixing the logic was suggested by Ingo Molnar.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Jean Delvare 3ea7f5e25e fork_init: update max_threads comment
The comment explaining what value max_threads is set to is outdated.  The
maximum memory consumption ratio for thread structures was 1/2 until
February 2002, then it was briefly changed to 1/16 before being set to 1/8
which we still use today.  The comment was never updated to reflect that
change, it's about time.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Michal Hocko 35f71bc0a0 fork: report pid reservation failure properly
copy_process will report any failure in alloc_pid as ENOMEM currently
which is misleading because the pid allocation might fail not only when
the memory is short but also when the pid space is consumed already.

The current man page even mentions this case:

: EAGAIN
:
:       A system-imposed limit on the number of threads was encountered.
:       There are a number of limits that may trigger this error: the
:       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
:       limits the number of processes and threads for a real user ID, was
:       reached; the kernel's system-wide limit on the number of processes
:       and threads, /proc/sys/kernel/threads-max, was reached (see
:       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
:       was reached (see proc(5)).

so the current behavior is also incorrect wrt.  documentation.  POSIX man
page also suggest returing EAGAIN when the process count limit is reached.

This patch simply propagates error code from alloc_pid and makes sure we
return -EAGAIN due to reservation failure.  This will make behavior of
fork closer to both our documentation and POSIX.

alloc_pid might alsoo fail when the reaper in the pid namespace is dead
(the namespace basically disallows all new processes) and there is no
good error code which would match documented ones. We have traditionally
returned ENOMEM for this case which is misleading as well but as per
Eric W. Biederman this behavior is documented in man pid_namespaces(7)

: If the "init" process of a PID namespace terminates, the kernel
: terminates all of the processes in the namespace via a SIGKILL signal.
: This behavior reflects the fact that the "init" process is essential for
: the correct operation of a PID namespace.  In this case, a subsequent
: fork(2) into this PID namespace will fail with the error ENOMEM; it is
: not possible to create a new processes in a PID namespace whose "init"
: process has terminated.

and introducing a new error code would be too risky so let's stick to
ENOMEM for this case.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Vladimir Davydov 69828dce7a signal: remove warning about using SI_TKILL in rt_[tg]sigqueueinfo
Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue
a warning on the first attempt of doing it.  We use WARN_ON_ONCE, which is
not informative and, what is worse, taints the kernel, making the trinity
syscall fuzzer complain false-positively from time to time.

It does not look like we need this warning at all, because the behaviour
changed quite a long time ago (2.6.39), and if an application relies on
the old API, it gets EPERM anyway and can issue a warning by itself.

So let us zap the warning in kernel.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Oleg Nesterov 64a4096c5c ptrace: ptrace_detach() can no longer race with SIGKILL
ptrace_detach() re-checks ->ptrace under tasklist lock and calls
release_task() if __ptrace_detach() returns true.  This was needed because
the __TASK_TRACED tracee could be killed/untraced, and it could even pass
exit_notify() before we take tasklist_lock.

But this is no longer possible after 9899d11f65 "ptrace: ensure
arch_ptrace/ptrace_request can never race with SIGKILL".  We can turn
these checks into WARN_ON() and remove release_task().

While at it, document the setting of child->exit_code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Labath <labath@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Oleg Nesterov b72c186999 ptrace: fix race between ptrace_resume() and wait_task_stopped()
ptrace_resume() is called when the tracee is still __TASK_TRACED.  We set
tracee->exit_code and then wake_up_state() changes tracee->state.  If the
tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T)
wrongly looks like another report from tracee.

This confuses debugger, and since wait_task_stopped() clears ->exit_code
the tracee can miss a signal.

Test-case:

	#include <stdio.h>
	#include <unistd.h>
	#include <sys/wait.h>
	#include <sys/ptrace.h>
	#include <pthread.h>
	#include <assert.h>

	int pid;

	void *waiter(void *arg)
	{
		int stat;

		for (;;) {
			assert(pid == wait(&stat));
			assert(WIFSTOPPED(stat));
			if (WSTOPSIG(stat) == SIGHUP)
				continue;

			assert(WSTOPSIG(stat) == SIGCONT);
			printf("ERR! extra/wrong report:%x\n", stat);
		}
	}

	int main(void)
	{
		pthread_t thread;

		pid = fork();
		if (!pid) {
			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
			for (;;)
				kill(getpid(), SIGHUP);
		}

		assert(pthread_create(&thread, NULL, waiter, NULL) == 0);

		for (;;)
			ptrace(PTRACE_CONT, pid, 0, SIGCONT);

		return 0;
	}

Note for stable: the bug is very old, but without 9899d11f65 "ptrace:
ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix
should use lock_task_sighand(child).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Pavel Labath <labath@google.com>
Tested-by: Pavel Labath <labath@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:06 -04:00
Linus Torvalds 8053871d0f smp: Fix smp_call_function_single_async() locking
The current smp_function_call code suffers a number of problems, most
notably smp_call_function_single_async() is broken.

The problem is that flush_smp_call_function_queue() does csd_unlock()
_after_ calling csd->func(). This means that a caller cannot properly
synchronize the csd usage as it has to.

Change the code to release the csd before calling ->func() for the
async case, and put a WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK) in
smp_call_function_single_async() to warn us of improper serialization,
because any waiting there can results in deadlocks when called with
IRQs disabled.

Rename the (currently) unused WAIT flag to SYNCHRONOUS and (re)use it
such that we know what to do in flush_smp_call_function_queue().

Rework csd_{,un}lock() to use smp_load_acquire() / smp_store_release()
to avoid some full barriers while more clearly providing lock
semantics.

Finally move the csd maintenance out of generic_exec_single() into its
callers for clearer code.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Rafael David Tinoco <inaddy@ubuntu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-17 09:57:52 +02:00
Peter Zijlstra d7bc3197b4 lockdep: Make print_lock() robust against concurrent release
During sysrq's show-held-locks command it is possible that
hlock_class() returns NULL for a given lock. The result is then (after
the warning):

	|BUG: unable to handle kernel NULL pointer dereference at 0000001c
	|IP: [<c1088145>] get_usage_chars+0x5/0x100
	|Call Trace:
	| [<c1088263>] print_lock_name+0x23/0x60
	| [<c1576b57>] print_lock+0x5d/0x7e
	| [<c1088314>] lockdep_print_held_locks+0x74/0xe0
	| [<c1088652>] debug_show_all_locks+0x132/0x1b0
	| [<c1315c48>] sysrq_handle_showlocks+0x8/0x10

This *might* happen because the thread on the other CPU drops the lock
after we are looking ->lockdep_depth and ->held_locks points no longer
to a lock that is held.

The fix here is to simply ignore it and continue.

Reported-by: Andreas Messerschmid <andreas@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-17 09:42:14 +02:00
Alexei Starovoitov 725f9dcd58 bpf: fix two bugs in verification logic when accessing 'ctx' pointer
1.
first bug is a silly mistake. It broke tracing examples and prevented
simple bpf programs from loading.

In the following code:
if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
} else if (...) {
  // this part should have been executed when
  // insn->code == BPF_W and insn->imm != 0
}

Obviously it's not doing that. So simple instructions like:
r2 = *(u64 *)(r1 + 8)
will be rejected. Note the comments in the code around these branches
were and still valid and indicate the true intent.

Replace it with:
if (BPF_SIZE(insn->code) != BPF_W)
  continue;

if (insn->imm == 0) {
} else if (...) {
  // now this code will be executed when
  // insn->code == BPF_W and insn->imm != 0
}

2.
second bug is more subtle.
If malicious code is using the same dest register as source register,
the checks designed to prevent the same instruction to be used with different
pointer types will fail to trigger, since we were assigning src_reg_type
when it was already overwritten by check_mem_access().
The fix is trivial. Just move line:
src_reg_type = regs[insn->src_reg].type;
before check_mem_access().
Add new 'access skb fields bad4' test to check this case.

Fixes: 9bac3d6d54 ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:08:49 -04:00
Alexei Starovoitov c3de6317d7 bpf: fix verifier memory corruption
Due to missing bounds check the DAG pass of the BPF verifier can corrupt
the memory which can cause random crashes during program loading:

[8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
[8.451293] IP: [<ffffffff811de33d>] kmem_cache_alloc_trace+0x8d/0x2f0
[8.452329] Oops: 0000 [#1] SMP
[8.452329] Call Trace:
[8.452329]  [<ffffffff8116cc82>] bpf_check+0x852/0x2000
[8.452329]  [<ffffffff8116b7e4>] bpf_prog_load+0x1e4/0x310
[8.452329]  [<ffffffff811b190f>] ? might_fault+0x5f/0xb0
[8.452329]  [<ffffffff8116c206>] SyS_bpf+0x806/0xa30

Fixes: f1bca824da ("bpf: add search pruning optimization to verifier")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 12:06:11 -04:00
Joonsoo Kim 84fce9db4d tracing: Fix incorrect enabling of trace events by boot cmdline
There is a problem that trace events are not properly enabled with
boot cmdline. The problem is that if we pass "trace_event=kmem:mm_page_alloc"
to the boot cmdline, it enables all kmem trace events, and not just
the page_alloc event.

This is caused by the parsing mechanism. When we parse the cmdline, the buffer
contents is modified due to tokenization. And, if we use this buffer
again, we will get the wrong result.

Unfortunately, this buffer is be accessed three times to set trace events
properly at boot time. So, we need to handle this situation.

There is already code handling ",", but we need another for ":".
This patch adds it.

Link: http://lkml.kernel.org/r/1429159484-22977-1-git-send-email-iamjoonsoo.kim@lge.com

Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[ added missing return ret; ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-16 09:44:07 -04:00
Rabin Vincent ef99b88b16 tracing: Handle ftrace_dump() atomic context in graph_trace_open()
graph_trace_open() can be called in atomic context from ftrace_dump().
Use GFP_ATOMIC for the memory allocations when that's the case, in order
to avoid the following splat.

 BUG: sleeping function called from invalid context at mm/slab.c:2849
 in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/0
 Backtrace:
 ..
 [<8004dc94>] (__might_sleep) from [<801371f4>] (kmem_cache_alloc_trace+0x160/0x238)
  r7:87800040 r6:000080d0 r5:810d16e8 r4:000080d0
 [<80137094>] (kmem_cache_alloc_trace) from [<800cbd60>] (graph_trace_open+0x30/0xd0)
  r10:00000100 r9:809171a8 r8:00008e28 r7:810d16f0 r6:00000001 r5:810d16e8
  r4:810d16f0
 [<800cbd30>] (graph_trace_open) from [<800c79c4>] (trace_init_global_iter+0x50/0x9c)
  r8:00008e28 r7:808c853c r6:00000001 r5:810d16e8 r4:810d16f0 r3:800cbd30
 [<800c7974>] (trace_init_global_iter) from [<800c7aa0>] (ftrace_dump+0x90/0x2ec)
  r4:810d2580 r3:00000000
 [<800c7a10>] (ftrace_dump) from [<80414b2c>] (sysrq_ftrace_dump+0x1c/0x20)
  r10:00000100 r9:809171a8 r8:808f6e7c r7:00000001 r6:00000007 r5:0000007a
  r4:808d5394
 [<80414b10>] (sysrq_ftrace_dump) from [<800169b8>] (return_to_handler+0x0/0x18)
 [<80415498>] (__handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18)
  r8:808c8100 r7:808c8444 r6:00000101 r5:00000010 r4:84eb3210
 [<80415668>] (handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18)
 [<8042a760>] (pl011_int) from [<800169b8>] (return_to_handler+0x0/0x18)
  r10:809171bc r9:809171a8 r8:00000001 r7:00000026 r6:808c6000 r5:84f01e60
  r4:8454fe00
 [<8007782c>] (handle_irq_event_percpu) from [<80077b44>] (handle_irq_event+0x4c/0x6c)
  r10:808c7ef0 r9:87283e00 r8:00000001 r7:00000000 r6:8454fe00 r5:84f01e60
  r4:84f01e00
 [<80077af8>] (handle_irq_event) from [<8007aa28>] (handle_fasteoi_irq+0xf0/0x1ac)
  r6:808f52a4 r5:84f01e60 r4:84f01e00 r3:00000000
 [<8007a938>] (handle_fasteoi_irq) from [<80076dc0>] (generic_handle_irq+0x3c/0x4c)
  r6:00000026 r5:00000000 r4:00000026 r3:8007a938
 [<80076d84>] (generic_handle_irq) from [<80077128>] (__handle_domain_irq+0x8c/0xfc)
  r4:808c1e38 r3:0000002e
 [<8007709c>] (__handle_domain_irq) from [<800087b8>] (gic_handle_irq+0x34/0x6c)
  r10:80917748 r9:00000001 r8:88802100 r7:808c7ef0 r6:808c8fb0 r5:00000015
  r4:8880210c r3:808c7ef0
 [<80008784>] (gic_handle_irq) from [<80014044>] (__irq_svc+0x44/0x7c)

Link: http://lkml.kernel.org/r/1428953721-31349-1-git-send-email-rabin@rab.in
Link: http://lkml.kernel.org/r/1428957012-2319-1-git-send-email-rabin@rab.in

Cc: stable@vger.kernel.org # 3.13+
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-16 09:32:17 -04:00
Linus Torvalds eea3a00264 Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - the rest of MM

 - various misc bits

 - add ability to run /sbin/reboot at reboot time

 - printk/vsprintf changes

 - fiddle with seq_printf() return value

* akpm: (114 commits)
  parisc: remove use of seq_printf return value
  lru_cache: remove use of seq_printf return value
  tracing: remove use of seq_printf return value
  cgroup: remove use of seq_printf return value
  proc: remove use of seq_printf return value
  s390: remove use of seq_printf return value
  cris fasttimer: remove use of seq_printf return value
  cris: remove use of seq_printf return value
  openrisc: remove use of seq_printf return value
  ARM: plat-pxa: remove use of seq_printf return value
  nios2: cpuinfo: remove use of seq_printf return value
  microblaze: mb: remove use of seq_printf return value
  ipc: remove use of seq_printf return value
  rtc: remove use of seq_printf return value
  power: wakeup: remove use of seq_printf return value
  x86: mtrr: if: remove use of seq_printf return value
  linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
  MAINTAINERS: CREDITS: remove Stefano Brivio from B43
  .mailmap: add Ricardo Ribalda
  CREDITS: add Ricardo Ribalda Delgado
  ...
2015-04-15 16:39:15 -07:00
Joe Perches 962e3707d9 tracing: remove use of seq_printf return value
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03 ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Miscellanea:

o Remove unused return value from trace_lookup_stack

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:25 -07:00
Joe Perches 94ff212d09 cgroup: remove use of seq_printf return value
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03 ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:25 -07:00
Joel Stanley 7a54f46b30 kernel/reboot.c: add orderly_reboot for graceful reboot
The kernel has orderly_poweroff which allows the kernel to initiate a
graceful shutdown of userspace, by running /sbin/poweroff.  This adds
orderly_reboot that will cause userspace to shut itself down by calling
/sbin/reboot.

This will be used for shutdown initiated by a system controller on
platforms that do not use ACPI.

orderly_reboot() should be used when the system wants to allow userspace
to gracefully shut itself down.  For cases where the system may imminently
catch on fire, the existing emergency_restart() provides an immediate
reboot without involving userspace.

Signed-off-by: Joel Stanley <joel@jms.id.au>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:23 -07:00
Aaron Tomlin 972fae6993 kernel/hung_task.c: change hung_task.c to use for_each_process_thread()
In check_hung_uninterruptible_tasks() avoid the use of deprecated
while_each_thread().

The "max_count" logic will prevent a livelock - see commit 0c740d0a
("introduce for_each_thread() to replace the buggy while_each_thread()").
Having said this let's use for_each_process_thread().

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Jakub Sitnicki 96831c0a67 kernel/resource.c: remove deprecated __check_region() and friends
All users of __check_region(), check_region(), and check_mem_region() are
gone.  We got rid of the last user in v4.0-rc1.  Remove them.

bloat-o-meter on x86_64 shows:

add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function                                     old     new   delta
__kstrtab___check_region                      15       -     -15
__ksymtab___check_region                      16       -     -16
__check_region                                71       -     -71

Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Iulia Manda 2813893f8b kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root.  For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional.  It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build.  The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB.  (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu.  All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00
Eric B Munson 5bbe3547aa mm: allow compaction of unevictable pages
Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration.  The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion.  The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state.  This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru.  Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data.  These maps are
created locked and read only.  Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool.  When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory.  When the value is set to 1, allocations
succeed.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:17 -07:00
Vladimir Davydov adbe427b92 memcg: zap mem_cgroup_lookup()
mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which
checks that id != 0 before issuing the function call.  Today, there is
no point in this additional check apart from optimization, because there
is no css with id <= 0, so that css_from_id, called by
mem_cgroup_from_id, will return NULL for any id <= 0.

Since mem_cgroup_from_id is only called from mem_cgroup_lookup, let us
zap mem_cgroup_lookup, substituting calls to it with mem_cgroup_from_id
and moving the check if id > 0 to css_from_id.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:16 -07:00
Linus Torvalds fa2e5c073a Merge branch 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc
Pull exec domain removal from Richard Weinberger:
 "This series removes execution domain support from Linux.

  The idea behind exec domains was to support different ABIs.  The
  feature was never complete nor stable.  Let's rip it out and make the
  kernel signal handling code less complicated"

* 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (27 commits)
  arm64: Removed unused variable
  sparc: Fix execution domain removal
  Remove rest of exec domains.
  arch: Remove exec_domain from remaining archs
  arc: Remove signal translation and exec_domain
  xtensa: Remove signal translation and exec_domain
  xtensa: Autogenerate offsets in struct thread_info
  x86: Remove signal translation and exec_domain
  unicore32: Remove signal translation and exec_domain
  um: Remove signal translation and exec_domain
  tile: Remove signal translation and exec_domain
  sparc: Remove signal translation and exec_domain
  sh: Remove signal translation and exec_domain
  s390: Remove signal translation and exec_domain
  mn10300: Remove signal translation and exec_domain
  microblaze: Remove signal translation and exec_domain
  m68k: Remove signal translation and exec_domain
  m32r: Remove signal translation and exec_domain
  m32r: Autogenerate offsets in struct thread_info
  frv: Remove signal translation and exec_domain
  ...
2015-04-15 13:53:55 -07:00
Linus Torvalds fa927894bb Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs update from Al Viro:
 "Now that net-next went in...  Here's the next big chunk - killing
  ->aio_read() and ->aio_write().

  There'll be one more pile today (direct_IO changes and
  generic_write_checks() cleanups/fixes), but I'd prefer to keep that
  one separate"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  ->aio_read and ->aio_write removed
  pcm: another weird API abuse
  infinibad: weird APIs switched to ->write_iter()
  kill do_sync_read/do_sync_write
  fuse: use iov_iter_get_pages() for non-splice path
  fuse: switch to ->read_iter/->write_iter
  switch drivers/char/mem.c to ->read_iter/->write_iter
  make new_sync_{read,write}() static
  coredump: accept any write method
  switch /dev/loop to vfs_iter_write()
  serial2002: switch to __vfs_read/__vfs_write
  ashmem: use __vfs_read()
  export __vfs_read()
  autofs: switch to __vfs_write()
  new helper: __vfs_write()
  switch hugetlbfs to ->read_iter()
  coda: switch to ->read_iter/->write_iter
  ncpfs: switch to ->read_iter/->write_iter
  net/9p: remove (now-)unused helpers
  p9_client_attach(): set fid->uid correctly
  ...
2015-04-15 13:22:56 -07:00
David Howells 7682c91843 VFS: kernel/: d_inode() annotations
relayfs and tracefs are dealing with inodes of their own;
those two act as filesystem drivers

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:55 -04:00
David Howells 3b362157b2 VFS: audit: d_backing_inode() annotations
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:55 -04:00
Linus Torvalds 6c373ca893 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add BQL support to via-rhine, from Tino Reichardt.

 2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading.  From Floria Fainelli.

 3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

 4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

 5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

 6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc.  And use this to clean up the neigh layer, then use it to
    implement MPLS support.  All from Eric Biederman.

 7) Support L3 forwarding offloading in switches, from Scott Feldman.

 8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further.  From Alexander Duyck.

 9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf.  In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

12) Split out new connection request sockets so that they can be
    established in the main hash table.  Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath.  From Eric Dumazet.

13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

14) Support stable privacy address generation for RFC7217 in IPV6.  From
    Hannes Frederic Sowa.

15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
  fm10k: Bump driver version to 0.15.2
  fm10k: corrected VF multicast update
  fm10k: mbx_update_max_size does not drop all oversized messages
  fm10k: reset head instead of calling update_max_size
  fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
  fm10k: update xcast mode before synchronizing multicast addresses
  fm10k: start service timer on probe
  fm10k: fix function header comment
  fm10k: comment next_vf_mbx flow
  fm10k: don't handle mailbox events in iov_event path and always process mailbox
  fm10k: use separate workqueue for fm10k driver
  fm10k: Set PF queues to unlimited bandwidth during virtualization
  fm10k: expose tx_timeout_count as an ethtool stat
  fm10k: only increment tx_timeout_count in Tx hang path
  fm10k: remove extraneous "Reset interface" message
  fm10k: separate PF only stats so that VF does not display them
  fm10k: use hw->mac.max_queues for stats
  fm10k: only show actual queues, not the maximum in hardware
  fm10k: allow creation of VLAN on default vid
  fm10k: fix unused warnings
  ...
2015-04-15 09:00:47 -07:00
Rusty Russell b9cc4489c6 params: handle quotes properly for values not of form foo="bar".
When starting kernel with arguments like:
  init=/bin/sh -c "echo arguments"
the trailing double quote is not removed which results in following command
being executed:
  /bin/sh -c 'echo arguments"'

Reported-by: Arthur Gautier <baloo@gandi.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-04-15 13:31:23 +09:30
Linus Torvalds 2481bc7528 Power management and ACPI updates for v4.1-rc1
- Generic PM domains support update including new PM domain
    callbacks to handle device initialization better (Russell King,
    Rafael J Wysocki, Kevin Hilman).
 
  - Unified device properties API update including a new mechanism
    for accessing data provided by platform initialization code
    (Rafael J Wysocki, Adrian Hunter).
 
  - ARM cpuidle update including ARM32/ARM64 handling consolidation
    (Daniel Lezcano).
 
  - intel_idle update including support for the Silvermont Core in
    the Baytrail SOC and for the Airmont Core in the Cherrytrail and
    Braswell SOCs (Len Brown, Mathias Krause).
 
  - New cpufreq driver for Hisilicon ACPU (Leo Yan).
 
  - intel_pstate update including support for the Knights Landing
    chip (Dasaratharaman Chandramouli, Kristen Carlson Accardi).
 
  - QorIQ cpufreq driver update (Tang Yuantian, Arnd Bergmann).
 
  - powernv cpufreq driver update (Shilpasri G Bhat).
 
  - devfreq update including Tegra support changes (Tomeu Vizoso,
    MyungJoo Ham, Chanwoo Choi).
 
  - powercap RAPL (Running-Average Power Limit) driver update
    including support for Intel Broadwell server chips (Jacob Pan,
    Mathias Krause).
 
  - ACPI device enumeration update related to the handling of the
    special PRP0001 device ID allowing DT-style 'compatible' property
    to be used for ACPI device identification (Rafael J Wysocki).
 
  - ACPI EC driver update including limited _DEP support (Lan Tianyu,
    Lv Zheng).
 
  - ACPI backlight driver update including a new mechanism to allow
    native backlight handling to be forced on non-Windows 8 systems
    and a new quirk for Lenovo Ideapad Z570 (Aaron Lu, Hans de Goede).
 
  - New Windows Vista compatibility quirk for Sony VGN-SR19XN (Chen Yu).
 
  - Assorted ACPI fixes and cleanups (Aaron Lu, Martin Kepplinger,
    Masanari Iida, Mika Westerberg, Nan Li, Rafael J Wysocki).
 
  - Fixes related to suspend-to-idle for the iTCO watchdog driver and
    the ACPI core system suspend/resume code (Rafael J Wysocki, Chen Yu).
 
  - PM tracing support for the suspend phase of system suspend/resume
    transitions (Zhonghui Fu).
 
  - Configurable delay for the system suspend/resume testing facility
    (Brian Norris).
 
  - PNP subsystem cleanups (Peter Huewe, Rafael J Wysocki).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJVLbO+AAoJEILEb/54YlRx5N4QAJXsmEW1FL2l6mMAyTQkEsVj
 nbqjF9I6aJgYM9+i8GKaZJxpN17SAZ7Ii7aCAXjPwX8AvjT70+gcZr+KDWtPir61
 B75VNVEcUYOR4vOF5Z6rQcQMlhGPkfMOJYXFMahpOG6DdPbVh1x2/tuawfc6IC0V
 a6S/fln6WqHrXQ+8swDSv1KuZsav6+8AQaTlNUQkkuXdY9b3k/3xiy5C2K26APP8
 x1B39iAF810qX6ipnK0gEOC3Vs29dl7hvNmgOVmmkBGVS7+pqTuy5n1/9M12cDRz
 78IQ7DXB0NcSwr5tdrmGVUyH0Q6H9lnD3vO7MJkYwKDh5a/2MiBr2GZc4KHDKDWn
 E1sS27f1Pdn9qnpWLzTcY+yYNV3EEyre56L2fc+sh+Xq9sNOjUah+Y/eAej/IxYD
 XYRf+GAj768yCJgNP+Y3PJES/PRh+0IZ/dn5k0Qq2iYvc8mcObyG6zdQIvCucv/i
 70uV1Z2GWEb31cI9TUV8o5GrMW3D0KI9EsCEEpiFFUnhjNog3AWcerGgFQMHxu7X
 ZnNSzudvek+XJ3NtpbPgTiJAmnMz8bDvBQm3G1LUO2TQdjYTU6YMUHsfzXs8DL6c
 aIMWO4stkVuDtWrlT/hfzIXepliccyXmSP6sbH+zNNCepulXe5C4M2SftaDi4l/B
 uIctXWznvHoGys+EFL+v
 =erd3
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI updates from Rafael Wysocki:
 "These are mostly fixes and cleanups all over, although there are a few
  items that sort of fall into the new feature category.

  First off, we have new callbacks for PM domains that should help us to
  handle some issues related to device initialization in a better way.

  There also is some consolidation in the unified device properties API
  area allowing us to use that inferface for accessing data coming from
  platform initialization code in addition to firmware-provided data.

  We have some new device/CPU IDs in a few drivers, support for new
  chips and a new cpufreq driver too.

  Specifics:

   - Generic PM domains support update including new PM domain callbacks
     to handle device initialization better (Russell King, Rafael J
     Wysocki, Kevin Hilman)

   - Unified device properties API update including a new mechanism for
     accessing data provided by platform initialization code (Rafael J
     Wysocki, Adrian Hunter)

   - ARM cpuidle update including ARM32/ARM64 handling consolidation
     (Daniel Lezcano)

   - intel_idle update including support for the Silvermont Core in the
     Baytrail SOC and for the Airmont Core in the Cherrytrail and
     Braswell SOCs (Len Brown, Mathias Krause)

   - New cpufreq driver for Hisilicon ACPU (Leo Yan)

   - intel_pstate update including support for the Knights Landing chip
     (Dasaratharaman Chandramouli, Kristen Carlson Accardi)

   - QorIQ cpufreq driver update (Tang Yuantian, Arnd Bergmann)

   - powernv cpufreq driver update (Shilpasri G Bhat)

   - devfreq update including Tegra support changes (Tomeu Vizoso,
     MyungJoo Ham, Chanwoo Choi)

   - powercap RAPL (Running-Average Power Limit) driver update including
     support for Intel Broadwell server chips (Jacob Pan, Mathias Krause)

   - ACPI device enumeration update related to the handling of the
     special PRP0001 device ID allowing DT-style 'compatible' property
     to be used for ACPI device identification (Rafael J Wysocki)

   - ACPI EC driver update including limited _DEP support (Lan Tianyu,
     Lv Zheng)

   - ACPI backlight driver update including a new mechanism to allow
     native backlight handling to be forced on non-Windows 8 systems and
     a new quirk for Lenovo Ideapad Z570 (Aaron Lu, Hans de Goede)

   - New Windows Vista compatibility quirk for Sony VGN-SR19XN (Chen Yu)

   - Assorted ACPI fixes and cleanups (Aaron Lu, Martin Kepplinger,
     Masanari Iida, Mika Westerberg, Nan Li, Rafael J Wysocki)

   - Fixes related to suspend-to-idle for the iTCO watchdog driver and
     the ACPI core system suspend/resume code (Rafael J Wysocki, Chen Yu)

   - PM tracing support for the suspend phase of system suspend/resume
     transitions (Zhonghui Fu)

   - Configurable delay for the system suspend/resume testing facility
     (Brian Norris)

   - PNP subsystem cleanups (Peter Huewe, Rafael J Wysocki)"

* tag 'pm+acpi-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits)
  ACPI / scan: Fix NULL pointer dereference in acpi_companion_match()
  ACPI / scan: Rework modalias creation when "compatible" is present
  intel_idle: mark cpu id array as __initconst
  powercap / RAPL: mark rapl_ids array as __initconst
  powercap / RAPL: add ID for Broadwell server
  intel_pstate: Knights Landing support
  intel_pstate: remove MSR test
  cpufreq: fix qoriq uniprocessor build
  ACPI / scan: Take the PRP0001 position in the list of IDs into account
  ACPI / scan: Simplify acpi_match_device()
  ACPI / scan: Generalize of_compatible matching
  device property: Introduce firmware node type for platform data
  device property: Make it possible to use secondary firmware nodes
  PM / watchdog: iTCO: stop watchdog during system suspend
  cpufreq: hisilicon: add acpu driver
  ACPI / EC: Call acpi_walk_dep_device_list() after installing EC opregion handler
  cpufreq: powernv: Report cpu frequency throttling
  intel_idle: Add support for the Airmont Core in the Cherrytrail and Braswell SOCs
  intel_idle: Update support for Silvermont Core in Baytrail SOC
  PM / devfreq: tegra: Register governor on module init
  ...
2015-04-14 20:21:54 -07:00
Paul E. McKenney 8d7dc9283f rcu: Control grace-period delays directly from value
In a misguided attempt to avoid an #ifdef, the use of the
gp_init_delay module parameter was conditioned on the corresponding
RCU_TORTURE_TEST_SLOW_INIT Kconfig variable, using IS_ENABLED() at
the point of use in the code.  This meant that the compiler always saw
the delay, which meant that RCU_TORTURE_TEST_SLOW_INIT_DELAY had to be
unconditionally defined.  This in turn caused "make oldconfig" to ask
pointless questions about the value of RCU_TORTURE_TEST_SLOW_INIT_DELAY
in cases where it was not even used.

This commit avoids these pointless questions by defining gp_init_delay
under #ifdef.  In one branch, gp_init_delay is initialized to
RCU_TORTURE_TEST_SLOW_INIT_DELAY and is also a module parameter (thus
allowing boot-time modification), and in the other branch gp_init_delay
is a const variable initialized by default to zero.

This approach also simplifies the code at the delay point by eliminating
the IS_DEFINED().  Because gp_init_delay is constant zero in the no-delay
case intended for production use, the "gp_init_delay > 0" check causes
the delay to become dead code, as desired in this case.  In addition,
this commit replaces magic constant "10" with the preprocessor variable
PER_RCU_NODE_PERIOD, which controls the number of grace periods that
are allowed to elapse at full speed before a delay is inserted.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-04-14 19:33:59 -07:00
Linus Torvalds 1dcf58d6e6 Merge branch 'akpm' (patches from Andrew)
Merge first patchbomb from Andrew Morton:

 - arch/sh updates

 - ocfs2 updates

 - kernel/watchdog feature

 - about half of mm/

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
  Documentation: update arch list in the 'memtest' entry
  Kconfig: memtest: update number of test patterns up to 17
  arm: add support for memtest
  arm64: add support for memtest
  memtest: use phys_addr_t for physical addresses
  mm: move memtest under mm
  mm, hugetlb: abort __get_user_pages if current has been oom killed
  mm, mempool: do not allow atomic resizing
  memcg: print cgroup information when system panics due to panic_on_oom
  mm: numa: remove migrate_ratelimited
  mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
  mm: split ET_DYN ASLR from mmap ASLR
  s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
  mm: expose arch_mmap_rnd when available
  s390: standardize mmap_rnd() usage
  powerpc: standardize mmap_rnd() usage
  mips: extract logic for mmap_rnd()
  arm64: standardize mmap_rnd() usage
  x86: standardize mmap_rnd() usage
  arm: factor out mmap ASLR into mmap_rnd
  ...
2015-04-14 16:49:17 -07:00
David Rientjes 6e276d2a51 kernel, cpuset: remove exception for __GFP_THISNODE
Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
remove the obscure comment about it and its special-case exception.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:03 -07:00
Ulrich Obergfell 692297d8f9 watchdog: introduce the hardlockup_detector_disable() function
Have kvm_guest_init() use hardlockup_detector_disable() instead of
watchdog_enable_hardlockup_detector(false).

Remove the watchdog_hardlockup_detector_is_enabled() and the
watchdog_enable_hardlockup_detector() function which are no longer needed.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell b2f57c3a0d watchdog: clean up some function names and arguments
Rename the update_timers*() functions to update_watchdog*().

Remove the boolean argument from watchdog_enable_all_cpus() because
update_watchdog_all_cpus() is now a generic function to change the run
state of the lockup detectors and to have the lockup detectors use a new
sample period.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell 195daf665a watchdog: enable the new user interface of the watchdog mechanism
With the current user interface of the watchdog mechanism it is only
possible to disable or enable both lockup detectors at the same time.
This series introduces new kernel parameters and changes the semantics of
some existing kernel parameters, so that the hard lockup detector and the
soft lockup detector can be disabled or enabled individually.  With this
series applied, the user interface is as follows.

- parameters in /proc/sys/kernel

  . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

  . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

  . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

  . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

- kernel command line parameters

  . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

  . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

  . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

Also, remove the proc_dowatchdog() function which is no longer needed.

[dzickus@redhat.com: wrote changelog]
[dzickus@redhat.com: update documentation for kernel params and sysctl]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell bcfba4f4bf watchdog: implement error handling for failure to set up hardware perf events
If watchdog_nmi_enable() fails to set up the hardware perf event of one
CPU, the entire hard lockup detector is deemed unreliable.  Hence, disable
the hard lockup detector and shut down the hardware perf events on all
CPUs.

[dzickus@redhat.com: update comments to explain some code]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell 83a80a3907 watchdog: introduce separate handlers for parameters in /proc/sys/kernel
Separate handlers for each watchdog parameter in /proc/sys/kernel replace
the proc_dowatchdog() function.  Three of those handlers merely call
proc_watchdog_common() with one different argument.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell ef246a216b watchdog: introduce proc_watchdog_common()
Three of four handlers for the watchdog parameters in /proc/sys/kernel
essentially have to do the same thing.

  if the parameter is being read {
    return the state of the corresponding bit(s) in 'watchdog_enabled'
  } else {
    set/clear the state of the corresponding bit(s) in 'watchdog_enabled'
    update the run state of the lockup detector(s)
  }

Hence, introduce a common function that can be called by those handlers.
The callers pass a 'bit mask' to this function to indicate which bit(s)
should be set/cleared in 'watchdog_enabled'.

This function handles an uncommon race with watchdog_nmi_enable() where a
concurrent update of 'watchdog_enabled' is possible.  We use 'cmpxchg' to
detect the concurrency.  [This avoids introducing a new spinlock or a
mutex to synchronize updates of 'watchdog_enabled'.  Using the same lock
or mutex in watchdog thread context and in system call context needs to be
considered carefully because it can make the code prone to deadlock
situations in connection with parking/unparking the watchdog threads.]

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell f54c2274f5 watchdog: move definition of 'watchdog_proc_mutex' outside of proc_dowatchdog()
This series removes proc_dowatchdog().  Since multiple new functions need
the 'watchdog_proc_mutex' to serialize access to the watchdog parameters
in /proc/sys/kernel, move the mutex outside of any function.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Ulrich Obergfell a0c9cbb93d watchdog: introduce the proc_watchdog_update() function
This series introduces a separate handler for each watchdog parameter in
/proc/sys/kernel.  The separate handlers need a common function that they
can call to update the run state of the lockup detectors, or to have the
lockup detectors use a new sample period.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Ulrich Obergfell 84d56e66b9 watchdog: new definitions and variables, initialization
The hardlockup and softockup had always been tied together.  Due to the
request of KVM folks, they had a need to have one enabled but not the
other.  Internally rework the code to split things apart more cleanly.

There is a bunch of churn here, but the end result should be code that
should be easier to maintain and fix without knowing the internals of what
is going on.

This patch (of 9):

Introduce new definitions and variables to separate the user interface in
/proc/sys/kernel from the internal run state of the lockup detectors.  The
internal run state is represented by two bits in a new variable that is
named 'watchdog_enabled'.  This helps simplify the code, for example:

- In order to check if any of the two lockup detectors is enabled,
  it is sufficient to check if 'watchdog_enabled' is not zero.

- In order to enable/disable one or both lockup detectors,
  it is sufficient to set/clear one or both bits in 'watchdog_enabled'.

- Concurrent updates of 'watchdog_enabled' need not be synchronized via
  a spinlock or a mutex. Updates can either be atomic or concurrency can
  be detected by using 'cmpxchg'.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Linus Torvalds ca2ec32658 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:
 "Part one:

   - struct filename-related cleanups

   - saner iov_iter_init() replacements (and switching the syscalls to
     use of those)

   - ntfs switch to ->write_iter() (Anton)

   - aio cleanups and splitting iocb into common and async parts
     (Christoph)

   - assorted fixes (me, bfields, Andrew Elble)

  There's a lot more, including the completion of switchover to
  ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
  race fixes, etc, but that goes after #for-davem merge.  David has
  pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
  sg_start_req(): use import_iovec()
  sg_start_req(): make sure that there's not too many elements in iovec
  blk_rq_map_user(): use import_single_range()
  sg_io(): use import_iovec()
  process_vm_access: switch to {compat_,}import_iovec()
  switch keyctl_instantiate_key_common() to iov_iter
  switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
  aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
  vmsplice_to_user(): switch to import_iovec()
  kill aio_setup_single_vector()
  aio: simplify arguments of aio_setup_..._rw()
  aio: lift iov_iter_init() into aio_setup_..._rw()
  lift iov_iter into {compat_,}do_readv_writev()
  NFS: fix BUG() crash in notify_change() with patch to chown_common()
  dcache: return -ESTALE not -EBUSY on distributed fs race
  NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
  VFS: Add iov_iter_fault_in_multipages_readable()
  drop bogus check in file_open_root()
  switch security_inode_getattr() to struct path *
  constify tomoyo_realpath_from_path()
  ...
2015-04-14 15:31:03 -07:00
Linus Torvalds 6c8a53c9e6 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar:
 "Core kernel changes:

   - One of the more interesting features in this cycle is the ability
     to attach eBPF programs (user-defined, sandboxed bytecode executed
     by the kernel) to kprobes.

     This allows user-defined instrumentation on a live kernel image
     that can never crash, hang or interfere with the kernel negatively.
     (Right now it's limited to root-only, but in the future we might
     allow unprivileged use as well.)

     (Alexei Starovoitov)

   - Another non-trivial feature is per event clockid support: this
     allows, amongst other things, the selection of different clock
     sources for event timestamps traced via perf.

     This feature is sought by people who'd like to merge perf generated
     events with external events that were measured with different
     clocks:

       - cluster wide profiling

       - for system wide tracing with user-space events,

       - JIT profiling events

     etc.  Matching perf tooling support is added as well, available via
     the -k, --clockid <clockid> parameter to perf record et al.

     (Peter Zijlstra)

  Hardware enablement kernel changes:

   - x86 Intel Processor Trace (PT) support: which is a hardware tracer
     on steroids, available on Broadwell CPUs.

     The hardware trace stream is directly output into the user-space
     ring-buffer, using the 'AUX' data format extension that was added
     to the perf core to support hardware constraints such as the
     necessity to have the tracing buffer physically contiguous.

     This patch-set was developed for two years and this is the result.
     A simple way to make use of this is to use BTS tracing, the PT
     driver emulates BTS output - available via the 'intel_bts' PMU.
     More explicit PT specific tooling support is in the works as well -
     will probably be ready by 4.2.

     (Alexander Shishkin, Peter Zijlstra)

   - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
     feature of Intel Xeon CPUs that allows the measurement and
     allocation/partitioning of caches to individual workloads.

     These kernel changes expose the measurement side as a new PMU
     driver, which exposes various QoS related PMU events.  (The
     partitioning change is work in progress and is planned to be merged
     as a cgroup extension.)

     (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
     Waskiewicz Jr)

   - x86 Intel Haswell LBR call stack support: this is a new Haswell
     feature that allows the hardware recording of call chains, plus
     tooling support.  To activate this feature you have to enable it
     via the new 'lbr' call-graph recording option:

        perf record --call-graph lbr
        perf report

     or:

        perf top --call-graph lbr

     This hardware feature is a lot faster than stack walk or dwarf
     based unwinding, but has some limitations:

       - It reuses the current LBR facility, so LBR call stack and
         branch record can not be enabled at the same time.

       - It is only available for user-space callchains.

     (Yan, Zheng)

   - x86 Intel Broadwell CPU support and various event constraints and
     event table fixes for earlier models.

     (Andi Kleen)

   - x86 Intel HT CPUs event scheduling workarounds.  This is a complex
     CPU bug affecting the SNB,IVB,HSW families that results in counter
     value corruption.  The mitigation code is automatically enabled and
     is transparent.

     (Maria Dimakopoulou, Stephane Eranian)

  The perf tooling side had a ton of changes in this cycle as well, so
  I'm only able to list the user visible changes here, in addition to
  the tooling changes outlined above:

  User visible changes affecting all tools:

      - Improve support of compressed kernel modules (Jiri Olsa)
      - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
      - Bash completion for subcommands (Yunlong Song)
      - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
      - Support missing -f to override perf.data file ownership. (Yunlong Song)
      - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)

  User visible changes in individual tools:

    'perf data':

        New tool for converting perf.data to other formats, initially
        for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
        Sebastian Siewior)

    'perf diff':

        Add --kallsyms option (David Ahern)

    'perf list':

        Allow listing events with 'tracepoint' prefix (Yunlong Song)

        Sort the output of the command (Yunlong Song)

    'perf kmem':

        Respect -i option (Jiri Olsa)

        Print big numbers using thousands' group (Namhyung Kim)

        Allow -v option (Namhyung Kim)

        Fix alignment of slab result table (Namhyung Kim)

    'perf probe':

        Support multiple probes on different binaries on the same command line (Masami Hiramatsu)

        Support unnamed union/structure members data collection. (Masami Hiramatsu)

        Check kprobes blacklist when adding new events. (Masami Hiramatsu)

    'perf record':

        Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)

        Support recording running/enabled time (Andi Kleen)

    'perf sched':

        Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)

    'perf report' and 'perf top':

        Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)

        Indicate which callchain entries are annotated in the
        TUI hists browser (Arnaldo Carvalho de Melo)

        Add pid/tid filtering to 'report' and 'script' commands (David Ahern)

        Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
        cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
        events (Arnaldo Carvalho de Melo)

    'perf stat':

        Report unsupported events properly (Suzuki K. Poulose)

        Output running time and run/enabled ratio in CSV mode (Andi Kleen)

    'perf trace':

        Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)

        Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)

        Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)

        Dump stack on segfaults (Arnaldo Carvalho de Melo)

        No need to explicitely enable evsels for workload started from perf, let it
        be enabled via perf_event_attr.enable_on_exec, removing some events that take
        place in the 'perf trace' before a workload is really started by it.
        (Arnaldo Carvalho de Melo)

        Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)

  There's also been a ton of infrastructure work done, such as the
  split-out of perf's build system into tools/build/ and other changes -
  see the shortlog and changelog for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
  perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
  perf evlist: Fix type for references to data_head/tail
  perf probe: Check the orphaned -x option
  perf probe: Support multiple probes on different binaries
  perf buildid-list: Fix segfault when show DSOs with hits
  perf tools: Fix cross-endian analysis
  perf tools: Fix error path to do closedir() when synthesizing threads
  perf tools: Fix synthesizing fork_event.ppid for non-main thread
  perf tools: Add 'I' event modifier for exclude_idle bit
  perf report: Don't call map__kmap if map is NULL.
  perf tests: Fix attr tests
  perf probe: Fix ARM 32 building error
  perf tools: Merge all perf_event_attr print functions
  perf record: Add clockid parameter
  perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
  perf sched replay: Support using -f to override perf.data file ownership
  perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
  perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
  perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
  perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
  ...
2015-04-14 14:37:47 -07:00
Linus Torvalds e95e7f6270 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ changes from Ingo Molnar:
 "This tree adds full dynticks support to KVM guests (support the
  disabling of the timer tick on the guest).  The main missing piece was
  the recognition of guest execution as RCU extended quiescent state and
  related changes"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest
  context_tracking: Export context_tracking_user_enter/exit
  context_tracking: Run vtime_user_enter/exit only when state == CONTEXT_USER
  context_tracking: Add stub context_tracking_is_enabled
  context_tracking: Generalize context tracking APIs to support user and guest
  context_tracking: Rename context symbols to prepare for transition state
  ppc: Remove unused cpp symbols in kvm headers
2015-04-14 13:58:48 -07:00
Linus Torvalds 078838d565 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "The main changes in this cycle were:

   - changes permitting use of call_rcu() and friends very early in
     boot, for example, before rcu_init() is invoked.

   - add in-kernel API to enable and disable expediting of normal RCU
     grace periods.

   - improve RCU's handling of (hotplug-) outgoing CPUs.

   - NO_HZ_FULL_SYSIDLE fixes.

   - tiny-RCU updates to make it more tiny.

   - documentation updates.

   - miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
  cpu: Defer smpboot kthread unparking until CPU known to scheduler
  rcu: Associate quiescent-state reports with grace period
  rcu: Yet another fix for preemption and CPU hotplug
  rcu: Add diagnostics to grace-period cleanup
  rcutorture: Default to grace-period-initialization delays
  rcu: Handle outgoing CPUs on exit from idle loop
  cpu: Make CPU-offline idle-loop transition point more precise
  rcu: Eliminate ->onoff_mutex from rcu_node structure
  rcu: Process offlining and onlining only at grace-period start
  rcu: Move rcu_report_unblock_qs_rnp() to common code
  rcu: Rework preemptible expedited bitmask handling
  rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
  rcutorture: Enable slow grace-period initializations
  rcu: Provide diagnostic option to slow down grace-period initialization
  rcu: Detect stalls caused by failure to propagate up rcu_node tree
  rcu: Eliminate empty HOTPLUG_CPU ifdef
  rcu: Simplify sync_rcu_preempt_exp_init()
  rcu: Put all orphan-callback-related code under same comment
  rcu: Consolidate offline-CPU callback initialization
  ...
2015-04-14 13:36:04 -07:00
Linus Torvalds eeee78cf77 Some clean ups and small fixes, but the biggest change is the addition
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
 
 Tracepoints have helper functions for the TP_printk() called
 __print_symbolic() and __print_flags() that lets a numeric number be
 displayed as a a human comprehensible text. What is placed in the
 TP_printk() is also shown in the tracepoint format file such that
 user space tools like perf and trace-cmd can parse the binary data
 and express the values too. Unfortunately, the way the TRACE_EVENT()
 macro works, anything placed in the TP_printk() will be shown pretty
 much exactly as is. The problem arises when enums are used. That's
 because unlike macros, enums will not be changed into their values
 by the C pre-processor. Thus, the enum string is exported to the
 format file, and this makes it useless for user space tools.
 
 The TRACE_DEFINE_ENUM() solves this by converting the enum strings
 in the TP_printk() format into their number, and that is what is
 shown to user space. For example, the tracepoint tlb_flush currently
 has this in its format file:
 
      __print_symbolic(REC->reason,
         { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
         { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
         { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
         { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
 
 After adding:
 
      TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
      TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
      TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
      TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
 
 Its format file will contain this:
 
      __print_symbolic(REC->reason,
         { 0, "flush on task switch" },
         { 1, "remote shootdown" },
         { 2, "local shootdown" },
         { 3, "local mm shootdown" })
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVLBTuAAoJEEjnJuOKh9ldjHMIALdRS755TXCZGOf0r7O2akOR
 wMPeum7C+ae1mH+jCsJKUC0/jUfQKaMt/UxoHlipDgcGg8kD2jtGnGCw4Xlwvdsr
 y4rFmcTRSl1mo0zDSsg6ujoupHlVYN0+JPjrd7S3cv/llJoY49zcanNLF7S2XLeM
 dZCtWRLWYpBiWO68ai6AqJTnE/eGFIqBI048qb5Eg8dbK243SSeSIf9Ywhb+VsA+
 aq6F7cWI/H6j4tbeza8tAN19dcwenDro5EfCDY8ARQHJu1f6Y3+DLf2imjkd6Aiu
 JVAoGIjHIpI+djwCZC1u4gi4urjfOqYartrM3Q54tb3YWYqHeNqP2ASI2a4EpYk=
 =Ixwt
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "Some clean ups and small fixes, but the biggest change is the addition
  of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.

  Tracepoints have helper functions for the TP_printk() called
  __print_symbolic() and __print_flags() that lets a numeric number be
  displayed as a a human comprehensible text.  What is placed in the
  TP_printk() is also shown in the tracepoint format file such that user
  space tools like perf and trace-cmd can parse the binary data and
  express the values too.  Unfortunately, the way the TRACE_EVENT()
  macro works, anything placed in the TP_printk() will be shown pretty
  much exactly as is.  The problem arises when enums are used.  That's
  because unlike macros, enums will not be changed into their values by
  the C pre-processor.  Thus, the enum string is exported to the format
  file, and this makes it useless for user space tools.

  The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
  the TP_printk() format into their number, and that is what is shown to
  user space.  For example, the tracepoint tlb_flush currently has this
  in its format file:

     __print_symbolic(REC->reason,
        { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
        { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
        { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
        { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })

  After adding:

     TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
     TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
     TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
     TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);

  Its format file will contain this:

     __print_symbolic(REC->reason,
        { 0, "flush on task switch" },
        { 1, "remote shootdown" },
        { 2, "local shootdown" },
        { 3, "local mm shootdown" })"

* tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
  tracing: Add enum_map file to show enums that have been mapped
  writeback: Export enums used by tracepoint to user space
  v4l: Export enums used by tracepoints to user space
  SUNRPC: Export enums in tracepoints to user space
  mm: tracing: Export enums in tracepoints to user space
  irq/tracing: Export enums in tracepoints to user space
  f2fs: Export the enums in the tracepoints to userspace
  net/9p/tracing: Export enums in tracepoints to userspace
  x86/tlb/trace: Export enums in used by tlb_flush tracepoint
  tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
  tracing: Allow for modules to convert their enums to values
  tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
  tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
  tracing: Give system name a pointer
  brcmsmac: Move each system tracepoints to their own header
  iwlwifi: Move each system tracepoints to their own header
  mac80211: Move message tracepoints to their own header
  tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
  tracing: Add TRACE_SYSTEM_VAR to kvm-s390
  tracing: Add TRACE_SYSTEM_VAR to intel-sst
  ...
2015-04-14 10:49:03 -07:00
Linus Torvalds 3f3c73de77 This adds the new tracefs file system. This has been in linux-next for
more than one release, as I had it ready for the 4.0 merge window, but
 a last minute thing that needed to go into Linux first had to be done.
 That was that perf hard coded the file system number when reading
 /sys/kernel/debugfs/tracing directory making sure that the path had
 the debugfs mount # before it would parse the tracing file. This broke
 other use cases of perf, and the check is removed.
 
 Now when mounting /sys/kernel/debug, tracefs is automatically mounted
 in /sys/kernel/debug/tracing such that old tools will still see that
 path as expected. But now system admins can mount tracefs directly
 and not need to mount debugfs, which can expose security issues.
 A new directory is created when tracefs is configured such that
 system admins can now mount it separately (/sys/kernel/tracing).
 
 This branch is based off of Al Viro's vfs debugfs_automount branch
 at commit 163f9eb95a
 debugfs: Provide a file creation function that also takes an initial size
 to get the debugfs_create_automount() operation.
 I just noticed that Al rebased the pull to add his Signed-off-by to
 that commit, and the commit is now e59b4e9187.
 I did a git diff of those two and see they are the same. Only the
 latter has Al's SOB.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVLA6YAAoJEEjnJuOKh9ldv6AH/1JUINDQwV+M0VTwzbLogloo
 Sco0byLhskmx5KLVD7Vs8BJAGrgHTdit32kzBGmLGJvVCKBa+c8lwmRw6rnXg3uX
 K4kGp7BIyn1/geXoGpCmDKaLGXhDcw49hRzejKDg/OqFtxKTsSeQtG8fo29ps9Do
 0VaF6UDp8gYplC2N2BfpB59LVndrITQ3mSsBBeFPvS7IxFJXAhDBOq2yi0aI6HyJ
 ICo2L/bA9HLxMuceWrXbsun+RP68+AQlnFfAtok7AcuBzUYPCKY0shT2VMOUtpTt
 1dGxMxq6q1ACfmY7gbp47WMX9aKjWcSEr0V+IYx/xex6Maf0Xsujsy99bKYUWvs=
 =OcgU
 -----END PGP SIGNATURE-----

Merge tag 'trace-4.1-tracefs' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracefs from Steven Rostedt:
 "This adds the new tracefs file system.

  This has been in linux-next for more than one release, as I had it
  ready for the 4.0 merge window, but a last minute thing that needed to
  go into Linux first had to be done.  That was that perf hard coded the
  file system number when reading /sys/kernel/debugfs/tracing directory
  making sure that the path had the debugfs mount # before it would
  parse the tracing file.  This broke other use cases of perf, and the
  check is removed.

  Now when mounting /sys/kernel/debug, tracefs is automatically mounted
  in /sys/kernel/debug/tracing such that old tools will still see that
  path as expected.  But now system admins can mount tracefs directly
  and not need to mount debugfs, which can expose security issues.  A
  new directory is created when tracefs is configured such that system
  admins can now mount it separately (/sys/kernel/tracing)"

* tag 'trace-4.1-tracefs' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Have mkdir and rmdir be part of tracefs
  tracefs: Add directory /sys/kernel/tracing
  tracing: Automatically mount tracefs on debugfs/tracing
  tracing: Convert the tracing facility over to use tracefs
  tracefs: Add new tracefs file system
  tracing: Create cmdline tracer options on tracing fs init
  tracing: Only create tracer options files if directory exists
  debugfs: Provide a file creation function that also takes an initial size
2015-04-14 10:22:29 -07:00