Commit Graph

19516 Commits

Author SHA1 Message Date
Thomas Gleixner 74faaf7aa6 genirq: Move irq_chip_write_msi_msg() helper to core
No point to expose this to the world. The only legitimate user is the
core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
2014-12-07 21:49:45 +01:00
Greg Kroah-Hartman dd63af108f Merge 3.18-rc7 into tty-next
This resolves the merge issue with drivers/tty/serial/of_serial.c

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06 08:17:24 -08:00
Alexei Starovoitov ddd872bc30 bpf: verifier: add checks for BPF_ABS | BPF_IND instructions
introduce program type BPF_PROG_TYPE_SOCKET_FILTER that is used
for attaching programs to sockets where ctx == skb.

add verifier checks for ABS/IND instructions which can only be seen
in socket filters, therefore the check:
  if (env->prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER)
    verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-05 21:47:32 -08:00
Eric W. Biederman 0542f17bf2 userns: Document what the invariant required for safe unprivileged mappings.
The rule is simple.  Don't allow anything that wouldn't be allowed
without unprivileged mappings.

It was previously overlooked that establishing gid mappings would
allow dropping groups and potentially gaining permission to files and
directories that had lesser permissions for a specific group than for
all other users.

This is the rule needed to fix CVE-2014-8989 and prevent any other
security issues with new_idmap_permitted.

The reason for this rule is that the unix permission model is old and
there are programs out there somewhere that take advantage of every
little corner of it.  So allowing a uid or gid mapping to be
established without privielge that would allow anything that would not
be allowed without that mapping will result in expectations from some
code somewhere being violated.  Violated expectations about the
behavior of the OS is a long way to say a security issue.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-05 19:07:26 -06:00
Eric W. Biederman 7ff4d90b4c groups: Consolidate the setgroups permission checks
Today there are 3 instances of setgroups and due to an oversight their
permission checking has diverged.  Add a common function so that
they may all share the same permission checking code.

This corrects the current oversight in the current permission checks
and adds a helper to avoid this in the future.

A user namespace security fix will update this new helper, shortly.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-05 17:19:27 -06:00
Daniel Vetter 7bd0e226e3 drm/i915: compute wait_ioctl timeout correctly
We've lost the +1 required for correct timeouts in

commit 5ed0bdf21a
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Jul 16 21:05:06 2014 +0000

    drm: i915: Use nsec based interfaces

    Use ktime_get_raw_ns() and get rid of the back and forth timespec
    conversions.

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: John Stultz <john.stultz@linaro.org>

So fix this up by reinstating our handrolled _timeout function. While
at it bother with handling MAX_JIFFIES.

v2: Convert to usecs (we don't care about the accuracy anyway) first
to avoid overflow issues Dave Gordon spotted.

v3: Drop the explicit MAX_JIFFY_OFFSET check, usecs_to_jiffies should
take care of that already. It might be a bit too enthusiastic about it
though.

v4: Chris has a much nicer color, so use his implementation.

This requires to export nsec_to_jiffies from time.c.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Gordon <david.s.gordon@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82749
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2014-12-05 15:20:24 +02:00
Al Viro f77c80142e bury struct proc_ns in fs/proc
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.

That way struct proc_ns becomes invisible outside of fs/proc/*.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:54 -05:00
Al Viro 33c429405a copy address of proc_ns_ops into ns_common
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:47 -05:00
Al Viro 6344c433a4 new helpers: ns_alloc_inum/ns_free_inum
take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:36 -05:00
Al Viro 64964528b2 make proc_ns_operations work with struct ns_common * instead of void *
We can do that now.  And kill ->inum(), while we are at it - all instances
are identical.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:17 -05:00
Al Viro 3c04118461 switch the rest of proc_ns_operations to working with &...->ns
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:11 -05:00
Al Viro 435d5f4bb2 common object embedded into various struct ....ns
for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:31:00 -05:00
Tejun Heo 0479c8c549 workqueue: cosmetic update in rescuer_thread()
rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience.  There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly.  Replace it with the local variable.

This patch causes no functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-12-04 10:14:54 -05:00
Andy Lutomirski 7cc78f8fa0 context_tracking: Restore previous state in schedule_user
It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context.  This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03 20:55:58 -08:00
Rafael J. Wysocki d30d819dc8 PM: Drop CONFIG_PM_RUNTIME from the driver core
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so quite a few
depend on CONFIG_PM or even may be dropped entirely in some cases.

Replace CONFIG_PM_RUNTIME with CONFIG_PM in the PM core code.

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-12-04 00:46:58 +01:00
Dan Carpenter 3558a5ac50 tracing: Truncated output is better than nothing
The initial reason for this patch is that I noticed that:

	if (len > TRACE_BUF_SIZE)

is off by one.  In this code, if len == TRACE_BUF_SIZE, then it means we
have truncated the last character off the output string.  If we truncate
two or more characters then we exit without printing.

After some discussion, we decided that printing truncated data is better
than not printing at all so we should just use vscnprintf() and remove
the test entirely.  Also I have updated memcpy() to copy the NUL char
instead of setting the NUL in a separate step.

Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwanda

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-03 17:10:14 -05:00
Byungchul Park 8e1e1df29d tracing: Add additional marks to signal very large time deltas
Currently, function graph tracer prints "!" or "+" just before
function execution time to signal a function overhead, depending
on the time. And some tracers tracing latency also print "!" or
"+" just after time to signal overhead, depending on the interval
between events. Even it is usually enough to do that, we sometimes
need to signal for bigger execution time than 100 micro seconds.

For example, I used function graph tracer to detect if there is
any case that exit_mm() takes too much time. I did following steps
in /sys/kernel/debug/tracing. It was easier to detect very large
excution time with patched kernel than with original kernel.

$ echo exit_mm > set_graph_function
$ echo function_graph > current_tracer
$ echo > trace
$ cat trace_pipe > $LOGFILE
 ... (do something and terminate logging)
$ grep "\\$" $LOGFILE
 3) $ 22082032 us |                      } /* kernel_map_pages */
 3) $ 22082040 us |                    } /* free_pages_prepare */
 3) $ 22082113 us |                  } /* free_hot_cold_page */
 3) $ 22083455 us |                } /* free_hot_cold_page_list */
 3) $ 22083895 us |              } /* release_pages */
 3) $ 22177873 us |            } /* free_pages_and_swap_cache */
 3) $ 22178929 us |          } /* unmap_single_vma */
 3) $ 22198885 us |        } /* unmap_vmas */
 3) $ 22206949 us |      } /* exit_mmap */
 3) $ 22207659 us |    } /* mmput */
 3) $ 22207793 us |  } /* exit_mm */

And then, it was easy to find out that a schedule-out occured by
sub_preempt_count() within kernel_map_pages().

To detect very large function exection time caused by either problematic
function implementation or scheduling issues, this patch can be useful.

Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.com

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-03 17:10:13 -05:00
Steven Rostedt (Red Hat) eabb8980a9 tracing: Allow NOT to filter AND and OR clauses
Add support to allow not "!" for and (&&) and (||). That is:

 !(field1 == X && field2 == Y)

Where the value of the full clause will be notted.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-03 10:00:27 -05:00
Steven Rostedt (Red Hat) e12c09cf30 tracing: Add NOT to filtering logic
Ted noticed that he could not filter on an event for a bit being cleared.
That's because the filtering logic only tests event fields with a limited
number of comparisons which, for bit logic, only include "&", which can
test if a bit is set, but there's no good way to see if a bit is clear.

This adds a way to do: !(field & 2048)

Which returns true if the bit is not set, and false otherwise.

Note, currently !(field1 == 10 && field2 == 15) is not supported.
That is, the 'not' only works for direct comparisons, not for the
AND and OR logic.

Link: http://lkml.kernel.org/r/20141202021912.GA29096@thunk.org
Link: http://lkml.kernel.org/r/20141202120430.71979060@gandalf.local.home

Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-03 10:00:13 -05:00
Dave Airlie e8115e79aa Linux 3.18-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUe7l9AAoJEHm+PkMAQRiGkGcIAIryQ7NKn4IaxUtS807Lx4Ih
 obEnx7nNKZTXCZpD/7XQGHMMJyozMJR50PHZESJoHu4Luhv9h7EFRnyJ6MdqMlwn
 zla3zY0yRsHwPoJKcHbSE0CPHZz0WPQHj7IEbM+XJz2tMNJfbgTrezElmcCM4DRp
 c9ae+ggwZ2cyNYM0r2RSwSJ525WMh69f9dzSUE27fpvkllQgwqNs/jHYz8HNOEht
 FWcv5UhvzKjwJS3awULfOB3zH2QdFvVTrwAzd+kbV2Q6T6CaUoFRlhXeKUO6W2Jv
 pJM6oj8tMZUkdXEv7EQXT1kwEqC4DULTTTHs4tSF79O1ESmNfePiOwwBcwoM2nM=
 =kG1Y
 -----END PGP SIGNATURE-----

Merge tag 'v3.18-rc7' into drm-next

This fixes a bunch of conflicts prior to merging i915 tree.

Linux 3.18-rc7

Conflicts:
	drivers/gpu/drm/exynos/exynos_drm_drv.c
	drivers/gpu/drm/i915/i915_drv.c
	drivers/gpu/drm/i915/intel_pm.c
	drivers/gpu/drm/tegra/dc.c
2014-12-02 10:58:33 +10:00
David S. Miller 60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Rafael J. Wysocki af261127e9 Merge back earlier 'pm-runtime' material for 3.19-rc1. 2014-11-27 01:37:53 +01:00
John Stultz cb2aa63469 time: Fix sign bug in NTP mult overflow warning
In commit 6067dc5a8c ("time: Avoid possible NTP adjustment
mult overflow") a new check was added to watch for adjustments
that could cause a mult overflow.

Unfortunately the check compares a signed with unsigned value
and ignored the case where the adjustment was negative, which
causes spurious warn-ons on some systems (and seems like it
would result in problematic time adjustments there as well, due
to the early return).

Thus this patch adds a check to make sure the adjustment is
positive before we check for an overflow, and resovles the issue
in my testing.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Debugged-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1416890145-30048-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-25 07:18:34 +01:00
Andy Lutomirski 82975bc6a6 uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns.  I suspect that this is a mistake and that
the code only works because int3 is paranoid.

Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug.  With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-23 14:25:28 -08:00
Thomas Gleixner 90e362f4a7 sched: Provide update_curr callbacks for stop/idle scheduling classes
Chris bisected a NULL pointer deference in task_sched_runtime() to
commit 6e998916df 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.

Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine.  He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.

What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.

While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()

do_task_stat(()
 thread_group_cputime_adjusted()
   thread_group_cputime()
     task_cputime()
       task_sched_runtime()
        if (task_current(rq, p) && task_on_rq_queued(p)) {
          update_rq_clock(rq);
          up->sched_class->update_curr(rq);
        }

If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr.  Ooops.

Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.

Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task.  While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.

Fixes: 6e998916df 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason <clm@fb.com>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-23 14:14:40 -08:00
Jiang Liu 38b6a1cf3e PCI/MSI: Move cached entry functions to irq core
Required to support non PCI based MSI.

[ tglx: Extracted from Jiangs patch series ]

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:47 +01:00
Jiang Liu aeeb59657c genirq: Provide default callbacks for msi_domain_ops
Extend struct msi_domain_info and provide default callbacks for
msi_domain_ops.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/1416061447-9472-8-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:47 +01:00
Jiang Liu d9109698be genirq: Introduce msi_domain_alloc/free_irqs()
Introduce msi_domain_{alloc|free}_irqs() to alloc/free interrupts
from generic MSI irqdomain.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/1416061447-9472-7-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:47 +01:00
Jiang Liu f3cf8bb0d6 genirq: Add generic msi irq domain support
Implement the basic functions for MSI interrupt support with
hierarchical interrupt domains.

[ tglx: Extracted and combined from several patches ]

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:47 +01:00
Marc Zyngier f86eff222f genirq: Work around __irq_set_handler vs stacked domains ordering issues
With the introduction of stacked domains, we have the issue that,
depending on where in the stack this is called, __irq_set_handler
will succeed or fail: If this is called from the inner irqchip,
__irq_set_handler() will fail, as it will look at the outer domain
as the (desc->irq_data.chip == &no_irq_chip) test fails (we haven't
set the top level yet).

This patch implements the following: "If there is at least one
valid irqchip in the domain, it will probably sort itself out".
This is clearly not ideal, but it is far less confusing then
crashing because the top-level domain is not up yet.

[ tglx: Added comment and a protection against chained interrupts in
  	that context ]

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1416048553-29289-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:47 +01:00
Jiang Liu afb7da83b9 irqdomain: Introduce helper function irq_domain_add_hierarchy()
Introduce helper function irq_domain_add_hierarchy(), which creates
a linear irqdomain if parameter 'size' is not zero, otherwise creates
a tree irqdomain.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/1416061447-9472-5-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Jiang Liu 36d727310c irqdomain: Implement a method to automatically call parent domains alloc/free
Add a flags to irq_domain.flags to control whether the irqdomain core
should automatically call parent irqdomain's alloc/free callbacks. It
help to reduce hierarchy irqdomains users' code size.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/1416061447-9472-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Jiang Liu 1b5377087c genirq: Introduce helper irq_domain_set_info() to reduce duplicated code
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Jiang Liu 2cb625478f genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchip
Add IRQ_SET_MASK_OK_DONE in addition to IRQ_SET_MASK_OK and
IRQ_SET_MASK_OK_NOCOPY to support stacked irqchip. IRQ_SET_MASK_OK_DONE
is the same as IRQ_SET_MASK_OK to irq core. To stacked irqchip, it means
that ascendant irqchips have done all the work and no more handling
needed in descendant irqchips.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Jiang Liu 515085ef7e genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchip
Add callback irq_compose_msi_msg to struct irq_chip, which will be used
to support stacked irqchip.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Yingjoe Chen 56e8abab61 genirq: Add more helper functions to support stacked irq_chip
Add more helper function for stacked irq_chip to just call parent's
function.

Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Pawel Moll <pawel.moll@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Gran Likely <grant.likely@linaro.org>
Cc: Boris BREZILLON <boris.brezillon@free-electrons.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: <srv_heupstream@mediatek.com>
Cc: <yingjoe.chen@gmail.com>
Cc: <hc.yen@mediatek.com>
Cc: <eddie.huang@mediatek.com>
Cc: <nathan.chung@mediatek.com>
Cc: <yh.chen@mediatek.com>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1415893029-2971-3-git-send-email-yingjoe.chen@mediatek.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Jiang Liu 85f08c17de genirq: Introduce helper functions to support stacked irq_chip
Now we already support hierarchy irq_data, so introduce several helpers
to support stacked irq_chips.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Yingjoe Chen 0cc01abab6 irqdomain: Do irq_find_mapping and set_type for hierarchy irqdomain in case OF
It is possible to call irq_create_of_mapping to create/translate the
same IRQ from DT for multiple times. Perform irq_find_mapping check
and set_type for hierarchy irqdomain in irq_create_of_mapping() to
avoid duplicate these functionality in all outer most irqdomain.

Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:46 +01:00
Jiang Liu f8264e3496 irqdomain: Introduce new interfaces to support hierarchy irqdomains
We plan to use hierarchy irqdomain to suppport CPU vector assignment,
interrupt remapping controller, IO-APIC controller, MSI interrupt
and hypertransport interrupt etc on x86 platforms. So extend irqdomain
interfaces to support hierarchy irqdomain.

There are already many clients of current irqdomain interfaces.
To minimize the changes, we choose to introduce new version 2 interfaces
to support hierarchy instead of extending existing irqdomain interfaces.

According to Thomas's suggestion, the most important design decision is
to build hierarchy struct irq_data to support hierarchy irqdomain, so
hierarchy irqdomain related data could be saved in struct irq_data.
With support of hierarchy irq_data, we could also support stacked
irq_chips. This is most useful in case of set_affinity().

The new hierarchy irqdomain introduces following interfaces:
1) irq_domain_alloc_irqs()/irq_domain_free_irqs(): allocate/release IRQ
   and related resources.
2) __irq_domain_alloc_irqs(): a special version to support legacy IRQs.
3) irq_domain_activate_irq()/irq_domain_deactivate_irq(): program
   interrupt controllers to activate/deactivate interrupt.

There are also several help functions to ease irqdomain implemenations:
1) irq_domain_get_irq_data(): get irq_data associated with a specific
   irqdomain.
2) irq_domain_set_hwirq_and_chip(): save irqdomain specific data into
   irq_data.
3) irq_domain_alloc_irqs_parent()/irq_domain_free_irqs_parent(): invoke
   parent irqdomain's alloc/free callbacks.

We also changed irq_startup()/irq_shutdown() to invoke
irq_domain_activate_irq()/irq_domain_deactivate_irq() to program
interrupt controller when start/stop interrupts.

[ tglx: Folded parts of the later patch series in ]

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:45 +01:00
Tejun Heo cceb9bd633 Merge branch 'master' into for-3.19
Pull in to receive 54ef6df3f3 ("rcu: Provide counterpart to
rcu_dereference() for non-RCU situations").

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-11-22 09:32:08 -05:00
David S. Miller 1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Linus Torvalds 8b2ed21e84 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes: two NUMA fixes, two cputime fixes and an RCU/lockdep fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
  sched/cputime: Fix cpu_timer_sample_group() double accounting
  sched/numa: Avoid selecting oneself as swap target
  sched/numa: Fix out of bounds read in sched_init_numa()
  sched: Remove lockdep check in sched_move_task()
2014-11-21 15:44:54 -08:00
Linus Torvalds 13f5004c94 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: two Intel uncore driver fixes, a CPU-hotplug fix and a
  build dependencies fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Fix boot crash on SBOX PMU on Haswell-EP
  perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
  perf: Fix corruption of sibling list with hotplug
  perf/x86: Fix embarrasing typo
2014-11-21 15:44:07 -08:00
John Stultz 5322e4c264 time: Fixup comments to reflect usage of timespec64
Fix up a few comments that weren't updated when the
functions were converted to use timespec64 structures.

Acked-by: Arnd Bergmann <arnd.bergmann@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:59 -08:00
John Stultz 334334b5f5 time: Expose get_monotonic_coarse64() for in-kernel uses
Adds a timespec64 based get_monotonic_coarse64() implementation
that can be used as we convert internal users of
get_monotonic_coarse away from using timespecs.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:59 -08:00
John Stultz cdba2ec538 time: Expose getrawmonotonic64 for in-kernel uses
Adds a timespec64 based getrawmonotonic64() implementation
that can be used as we convert internal users of
getrawmonotonic away from using timespecs.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:58 -08:00
pang.xunlei 90b6ce9c40 time: Provide y2038 safe mktime() replacement
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds safe mktime64() using time64_t.

After this patch, mktime() is deprecated and all its call sites
will be fixed using mktime64(), after that it can be removed.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:58 -08:00
pang.xunlei 04d9089086 time: Provide y2038 safe timekeeping_inject_sleeptime() replacement
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds timekeeping_inject_sleeptime64() using timespec64.

After this patch, timekeeping_inject_sleeptime() is deprecated
and all its call sites will be fixed using the new interface,
after that it can be removed.

NOTE: timekeeping_inject_sleeptime() is safe actually, but we
want to eliminate timespec eventually, so comes this patch.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:57 -08:00
pang.xunlei 21f7eca555 time: Provide y2038 safe do_settimeofday() replacement
The kernel uses 32-bit signed value(time_t) for seconds elapsed
1970-01-01:00:00:00, thus it will overflow at 2038-01-19 03:14:08
on 32-bit systems. This is widely known as the y2038 problem.

As part of addressing "y2038 problem" for in-kernel uses, this patch
adds safe do_settimeofday64() using timespec64.

After this patch, do_settimeofday() is deprecated and all its call
sites will be fixed using do_settimeofday64(), after that it can be
removed.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:57 -08:00
pang.xunlei 659bc17b80 time: Complete NTP adjustment threshold judging conditions
The clocksource mult-adjustment threshold is [mult-maxadj, mult+maxadj],
timekeeping_adjust() only deals with the upper threshold, but misses the
lower threshold.

This patch adds the lower threshold judging condition.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Minor fix for > 80 char line]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:56 -08:00
pang.xunlei 6067dc5a8c time: Avoid possible NTP adjustment mult overflow.
Ideally, __clocksource_updatefreq_scale, selects the largest shift
value possible for a clocksource. This results in the mult memember of
struct clocksource being particularly large, although not so large
that NTP would adjust the clock to cause it to overflow.

That said, nothing actually prohibits an overflow from occuring, its
just that it "shouldn't" occur.

So while very unlikely, and so far never observed, the value of
(cs->mult+cs->maxadj) may have a chance to reach very near 0xFFFFFFFF,
so there is a possibility it may overflow when doing NTP positive
adjustment

See the following detail: When NTP slewes the clock, kernel goes
through update_wall_time()->...->timekeeping_apply_adjustment():
	tk->tkr.mult += mult_adj;

Since there is no guard against it, its possible tk->tkr.mult may
overflow during this operation.

This patch avoids any possible mult overflow by judging the overflow
case before adding mult_adj to mult, also adds the WARNING message
when capturing such case.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Reworded commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:56 -08:00
John Stultz fd866e2b11 time: Rename udelay_test.c to test_udelay.c
Kees requested that this test module be renamed for consistency sake,
so this patch renames the udelay_test.c file (recently added to
tip/timers/core for 3.17) to test_udelay.c

Cc: Kees Cook <keescook@chromium.org>
Cc: Greg KH <greg@kroah.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linux-Next <linux-next@vger.kernel.org>
Cc: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:55 -08:00
Masami Hiramatsu 1d70be34df kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
Add FTRACE_OPS_FL_IPMODIFY flag to kprobe_ftrace_ops
since kprobes can changes regs->ip.

Link: http://lkml.kernel.org/r/20141121102523.11844.21298.stgit@localhost.localdomain

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-21 14:44:15 -05:00
Masami Hiramatsu f8b8be8a31 ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
ftrace users who may modify regs->ip to change the execution
path. If two or more users modify the regs->ip on the same
function entry, one of them will be broken. So they must add
IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.

Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
flag to have notrace hash, and the ftrace_ops must have a
filter hash (so that the ftrace_ops can hook only specific
entries), because it strongly depends on the address and
must be allowed for only few selected functions.

Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain

Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ fixed up some of the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-21 14:42:10 -05:00
Steven Rostedt (Red Hat) 04b74b27c2 printk/percpu: Define printk_func when printk is not defined
To avoid include hell, the per_cpu variable printk_func was declared
in percpu.h. But it is only defined if printk is defined.

As users of printk may also use the printk_func variable, it needs to
be defined even if CONFIG_PRINTK is not.

Also add a printk.h include in percpu.h just to be safe.

Link: http://lkml.kernel.org/r/20141121183215.01ba539c@canb.auug.org.au

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-21 11:19:15 -05:00
Steven Rostedt (Red Hat) 0af26492d5 tracing/trivial: Fix typos and make an int into a bool
Fix up a few typos in comments and convert an int into a bool in
update_traceon_count().

Link: http://lkml.kernel.org/r/546DD445.5080108@hitachi.com

Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-20 10:05:36 -05:00
Jiri Kosina a02001086b Merge Linus' tree to be be to apply submitted patches to newer code than
current trivial.git base
2014-11-20 14:42:02 +01:00
Frans Klaver eff264efee kernel: trace: fix printk message
s,produciton,production

Signed-off-by: Frans Klaver <frans.klaver@xsens.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-11-20 14:29:19 +01:00
Ingo Molnar d360b78f99 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Streamline RCU's use of per-CPU variables, shifting from "cpu"
   arguments to functions to "this_"-style per-CPU variable accessors.

 - Signal-handling RCU updates.

 - Real-time updates.

 - Torture-test updates.

 - Miscellaneous fixes.

 - Documentation updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-20 08:57:58 +01:00
Steven Rostedt (Red Hat) afdc34a3d3 printk: Add per_cpu printk func to allow printk to be diverted
Being able to divert printk to call another function besides the normal
logging is useful for such things like NMI handling. If some functions
are to be called from NMI that does printk() it is possible to lock up
the box if the nmi handler triggers when another printk is happening.

One example of this use is to perform a stack trace on all CPUs via NMI.
But if the NMI is to do the printk() it can cause the system to lock up.
By allowing the printk to be diverted to another function that can safely
record the printk output and then print it when it in a safe context
then NMIs will be safe to call these functions like show_regs().

Link: http://lkml.kernel.org/p/20140619213952.209176403@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:21 -05:00
Steven Rostedt (Red Hat) 8d58e99af5 seq_buf: Move the seq_buf code to lib/
The seq_buf functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.

The seq_buf utility is similar to the seq_file utility, but instead of
writing sending data back up to userland, it writes it into a buffer
defined at seq_buf_init(). This allows us to send a descriptor around
that writes printf() formatted strings into it that can be retrieved
later.

It is currently used by the tracing facility for such things like trace
events to convert its binary saved data in the ring buffer into an
ASCII human readable context to be displayed in /sys/kernel/debug/trace.

It can also be used for doing NMI prints safely from NMI context into
the seq_buf and retrieved later and dumped to printk() safely. Doing
printk() from an NMI context is dangerous because an NMI can preempt
a current printk() and deadlock on it.

Link: http://lkml.kernel.org/p/20140619213952.058255809@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:20 -05:00
Steven Rostedt (Red Hat) 2448913ed2 seq-buf: Make seq_buf_bprintf() conditional on CONFIG_BINARY_PRINTF
The function bstr_printf() from lib/vsprnintf.c is only available if
CONFIG_BINARY_PRINTF is defined. This is due to the only user currently
being the tracing infrastructure, which needs to select this config
when tracing is configured. Until there is another user of the binary
printf formats, this will continue to be the case.

Since seq_buf.c is now lives in lib/ and is compiled even without
tracing, it must encompass its use of bstr_printf() which is used
by seq_buf_printf(). This too is only used by the tracing infrastructure
and is still encapsulated by the CONFIG_BINARY_PRINTF.

Link: http://lkml.kernel.org/r/20141104160222.969013383@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:19 -05:00
Steven Rostedt (Red Hat) 01cb06a4c2 tracing: Add seq_buf_get_buf() and seq_buf_commit() helper functions
Add two helper functions; seq_buf_get_buf() and seq_buf_commit() that
are used by seq_buf_path(). This makes the code similar to the
seq_file: seq_path() function, and will help to be able to consolidate
the functions between seq_file and trace_seq.

Link: http://lkml.kernel.org/r/20141104160222.644881406@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.977571447@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:18 -05:00
Steven Rostedt (Red Hat) 8cd709ae76 tracing: Have seq_buf use full buffer
Currently seq_buf is full when all but one byte of the buffer is
filled. Change it so that the seq_buf is full when all of the
buffer is filled.

Some of the functions would fill the buffer completely and report
everything was fine. This was inconsistent with the max of size - 1.
Changing this to be max of size makes all functions consistent.

Link: http://lkml.kernel.org/r/20141104160222.502133196@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.811957882@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:17 -05:00
Steven Rostedt (Red Hat) 9b77215382 seq_buf: Add seq_buf_can_fit() helper function
Add a seq_buf_can_fit() helper function that removes the possible mistakes
of comparing the seq_buf length plus added data compared to the size of
the buffer.

Link: http://lkml.kernel.org/r/20141118164025.GL23958@pathway.suse.cz

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:17 -05:00
Steven Rostedt (Red Hat) 820b75f63d tracing: Add paranoid size check in trace_printk_seq()
To be really paranoid about writing out of bound data in
trace_printk_seq(), add another check of len compared to size.

Link: http://lkml.kernel.org/r/20141119144004.GB2332@dhcp128.suse.cz

Suggested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:16 -05:00
Steven Rostedt (Red Hat) 5ac4837841 tracing: Use trace_seq_used() and seq_buf_used() instead of len
As the seq_buf->len will soon be +1 size when there's an overflow, we
must use trace_seq_used() or seq_buf_used() methods to get the real
length. This will prevent buffer overflow issues if just the len
of the seq_buf descriptor is used to copy memory.

Link: http://lkml.kernel.org/r/20141114121911.09ba3d38@gandalf.local.home

Reported-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:15 -05:00
Steven Rostedt (Red Hat) 74f06bb723 tracing: Clean up tracing_fill_pipe_page()
The function tracing_fill_pipe_page() logic is a little confusing with the
use of count saving the seq.len and reusing it.

Instead of subtracting a number that is calculated from the saved
value of the seq.len from seq.len, just save the seq.len at the start
and if we need to reset it, just assign it again.

When the seq_buf overflow is len == size + 1, the current logic will
break. Changing it to use a saved length for resetting back to the
original value is more robust and will work when we change the way
seq_buf sets the overflow.

Link: http://lkml.kernel.org/r/20141118161546.GJ23958@pathway.suse.cz

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:14 -05:00
Steven Rostedt (Red Hat) eeab98154d seq_buf: Create seq_buf_used() to find out how much was written
Add a helper function seq_buf_used() that replaces the SEQ_BUF_USED()
private macro to let callers have a method to know how much of the
seq_buf was written to.

Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org
Link: http://lkml.kernel.org/r/20141114011413.321654244@goodmis.org

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:13 -05:00
Steven Rostedt (Red Hat) dd23180aac tracing: Convert seq_buf_path() to be like seq_path()
Rewrite seq_buf_path() like it is done in seq_path() and allow
it to accept any escape character instead of just "\n".

Making seq_buf_path() like seq_path() will help prevent problems
when converting seq_file to use the seq_buf logic.

Link: http://lkml.kernel.org/r/20141104160222.048795666@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.338523371@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:10 -05:00
Steven Rostedt (Red Hat) 3a161d99c4 tracing: Create seq_buf layer in trace_seq
Create a seq_buf layer that trace_seq sits on. The seq_buf will not
be limited to page size. This will allow other usages of seq_buf
instead of a hard set PAGE_SIZE one that trace_seq has.

Link: http://lkml.kernel.org/r/20141104160221.864997179@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 22:01:09 -05:00
Markus Elfring 16a8ef2751 tracing: Deletion of an unnecessary check before iput()
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/5468F875.7080907@users.sourceforge.net

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 16:28:45 -05:00
Alexei Starovoitov daaf427c6a bpf: fix arraymap NULL deref and missing overflow and zero size checks
- fix NULL pointer dereference:
kernel/bpf/arraymap.c:41 array_map_alloc() error: potential null dereference 'array'.  (kzalloc returns null)
kernel/bpf/arraymap.c:41 array_map_alloc() error: we previously assumed 'array' could be null (see line 40)

- integer overflow check was missing in arraymap
(hashmap checks for overflow via kmalloc_array())

- arraymap can round_up(value_size, 8) to zero. check was missing.

- hashmap was missing zero size check as well, since roundup_pow_of_two() can
truncate into zero

- found a typo in the arraymap comment and unnecessary empty line

Fix all of these issues and make both overflow checks explicit U32 in size.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-19 15:40:00 -05:00
Steven Rostedt (Red Hat) 8e2e095cbe tracing: Fix return value of ftrace_raw_output_prep()
If the trace_seq of ftrace_raw_output_prep() is full this function
returns TRACE_TYPE_PARTIAL_LINE, otherwise it returns zero.

The problem is that TRACE_TYPE_PARTIAL_LINE happens to be zero!

The thing is, the caller of ftrace_raw_output_prep() expects a
success to be zero. Change that to expect it to be
TRACE_TYPE_HANDLED.

Link: http://lkml.kernel.org/r/20141114112522.GA2988@dhcp128.suse.cz

Reminded-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:48 -05:00
Steven Rostedt (Red Hat) dba39448ab tracing: Remove return values of most trace_seq_*() functions
The trace_seq_printf() and friends are used to store strings into a buffer
that can be passed around from function to function. If the trace_seq buffer
fills up, it will not print any more. The return values were somewhat
inconsistant and using trace_seq_has_overflowed() was a better way to know
if the write to the trace_seq buffer succeeded or not.

Now that all users have removed reading the return value of the printf()
type functions, they can safely return void and keep future users of them
from reading the inconsistent values as well.

Link: http://lkml.kernel.org/r/20141114011411.992510720@goodmis.org

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:47 -05:00
Steven Rostedt (Red Hat) 183742f08c tracing: Do not use return values of trace_seq_printf() in syscall tracing
The functions trace_seq_printf() and friends will not be returning values
soon and will be void functions. To know if they succeeded or not, the
functions trace_seq_has_overflowed() and trace_handle_return() should be
used instead.

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:46 -05:00
Steven Rostedt (Red Hat) 8579a107a6 tracing/uprobes: Do not use return values of trace_seq_printf()
The functions trace_seq_printf() and friends will soon no longer have
return values. Using trace_seq_has_overflowed() and trace_handle_return()
should be used instead.

Link: http://lkml.kernel.org/r/20141114011411.693008134@goodmis.org
Link: http://lkml.kernel.org/r/20141115050602.333705855@goodmis.org

Reviewed-by: Masami Hiramatsu <masami.hiramatu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:45 -05:00
Steven Rostedt (Red Hat) d2b0191a38 tracing/probes: Do not use return value of trace_seq_printf()
The functions trace_seq_printf() and friends will soon not have a return
value and will only be a void function. Use trace_seq_has_overflowed()
instead to know if the trace_seq operations succeeded or not.

Link: http://lkml.kernel.org/r/20141114011411.530216306@goodmis.org

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:44 -05:00
Steven Rostedt (Red Hat) a72e10afab tracing: Do not check return values of trace_seq_p*() for mmio tracer
The return values for trace_seq_printf() and friends are going to be
removed and they will become void functions. The mmio tracer checked
their return and even did so incorrectly.

Some of the funtions which returned the values were never checked
themselves. Removing all the checks simplifies the code.

Use trace_seq_has_overflowed() and trace_handle_return() where
necessary instead.

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:44 -05:00
Steven Rostedt (Red Hat) 85224da0b8 kprobes/tracing: Use trace_seq_has_overflowed() for overflow checks
Instead of checking the return value of trace_seq_printf() and friends
for overflowing of the buffer, use the trace_seq_has_overflowed() helper
function.

This cleans up the code quite a bit and also takes us a step closer to
changing the return values of trace_seq_printf() and friends to void.

Link: http://lkml.kernel.org/r/20141114011411.181812785@goodmis.org

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:43 -05:00
Steven Rostedt (Red Hat) 9d9add34ec tracing: Have function_graph use trace_seq_has_overflowed()
Instead of doing individual checks all over the place that makes the code
very messy. Just check trace_seq_has_overflowed() at the end or in
strategic places.

This makes the code much cleaner and also helps with getting closer
to removing the return values of trace_seq_printf() and friends.

Link: http://lkml.kernel.org/r/20141114011410.987913836@goodmis.org

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:42 -05:00
Steven Rostedt (Red Hat) 7d40f67165 tracing: Have branch tracer use trace_handle_return() helper function
The branch tracer should not be checking the trace_seq_printf() return value
as that will soon be void. There's a new trace_handle_return() helper function
that will return TRACE_TYPE_PARTIAL_LINE if the trace_seq overflowed
and TRACE_TYPE_HANDLED otherwise.

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:41 -05:00
Steven Rostedt (Red Hat) c0cd93aa16 ring-buffer: Remove check of trace_seq_{puts,printf}() return values
Remove checking the return value of all trace_seq_puts(). It was wrong
anyway as only the last return value mattered. But as the trace_seq_puts()
is going to be a void function in the future, we should not be checking
the return value of it anyway.

Just return !trace_seq_has_overflowed() instead.

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:40 -05:00
Steven Rostedt (Red Hat) f4a1d08ce6 blktrace/tracing: Use trace_seq_has_overflowed() helper function
Checking the return code of every trace_seq_printf() operation and having
to return early if it overflowed makes the code messy.

Using the new trace_seq_has_overflowed() and trace_handle_return() functions
allows us to clean up the code.

In the future, trace_seq_printf() and friends will be turning into void
functions and not returning a value. The trace_seq_has_overflowed() is to
be used instead. This cleanup allows that change to take place.

Cc: Jens Axboe <axboe@fb.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:39 -05:00
Steven Rostedt (Red Hat) 19a7fe2062 tracing: Add trace_seq_has_overflowed() and trace_handle_return()
Adding a trace_seq_has_overflowed() which returns true if the trace_seq
had too much written into it allows us to simplify the code.

Instead of checking the return value of every call to trace_seq_printf()
and friends, they can all be called normally, and at the end we can
return !trace_seq_has_overflowed() instead.

Several functions also return TRACE_TYPE_PARTIAL_LINE when the trace_seq
overflowed and TRACE_TYPE_HANDLED otherwise. Another helper function
was created called trace_handle_return() which takes a trace_seq and
returns these enums. Using this helper function also simplifies the
code.

This change also makes it possible to remove the return values of
trace_seq_printf() and friends. They should instead just be
void functions.

Link: http://lkml.kernel.org/r/20141114011410.365183157@goodmis.org

Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:39 -05:00
Steven Rostedt (Red Hat) e400a40cff tracing: Fix trace_seq_bitmask() to start at current position
In trace_seq_bitmask() it calls bitmap_scnprintf() not from the current
position of the trace_seq buffer (s->buffer + s->len), but instead from
the beginning of the buffer (s->buffer).

Luckily, the only user of this "ipi_raise tracepoint" uses it as the
first parameter, and as such, the start of the temp buffer in
include/trace/ftrace.h (see __get_bitmask()).

Reported-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:38 -05:00
Steven Rostedt (Red Hat) aec0be2d6e ftrace/x86/extable: Add is_ftrace_trampoline() function
Stack traces that happen from function tracing check if the address
on the stack is a __kernel_text_address(). That is, is the address
kernel code. This calls core_kernel_text() which returns true
if the address is part of the builtin kernel code. It also calls
is_module_text_address() which returns true if the address belongs
to module code.

But what is missing is ftrace dynamically allocated trampolines.
These trampolines are allocated for individual ftrace_ops that
call the ftrace_ops callback functions directly. But if they do a
stack trace, the code checking the stack wont detect them as they
are neither core kernel code nor module address space.

Adding another field to ftrace_ops that also stores the size of
the trampoline assigned to it we can create a new function called
is_ftrace_trampoline() that returns true if the address is a
dynamically allocate ftrace trampoline. Note, it ignores trampolines
that are not dynamically allocated as they will return true with
the core_kernel_text() function.

Link: http://lkml.kernel.org/r/20141119034829.497125839@goodmis.org

Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-19 15:25:26 -05:00
Al Viro 9f45f5bf30 new helper: audit_file()
... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.

[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:26 -05:00
Al Viro b583043e99 kill f_dentry uses
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:25 -05:00
Alexey Ishchuk 4eafad7feb s390/kernel: add system calls for PCI memory access
Add the new __NR_s390_pci_mmio_write and __NR_s390_pci_mmio_read
system calls to allow user space applications to access device PCI I/O
memory pages on s390x platform.

[ Martin Schwidefsky: some code beautification ]

Signed-off-by: Alexey Ishchuk <aishchuk@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-11-19 09:46:43 +01:00
Steven Rostedt (Red Hat) a9ce7c36aa tracing: Fix race of function probes counting
The function probe counting for traceon and traceoff suffered a race
condition where if the probe was executing on two or more CPUs at the
same time, it could decrement the counter by more than one when
disabling (or enabling) the tracer only once.

The way the traceon and traceoff probes are suppose to work is that
they disable (or enable) tracing once per count. If a user were to
echo 'schedule:traceoff:3' into set_ftrace_filter, then when the
schedule function was called, it would disable tracing. But the count
should only be decremented once (to 2). Then if the user enabled tracing
again (via tracing_on file), the next call to schedule would disable
tracing again and the count would be decremented to 1.

But if multiple CPUS called schedule at the same time, it is possible
that the count would be decremented more than once because of the
simple "count--" used.

By reading the count into a local variable and using memory barriers
we can guarantee that the count would only be decremented once per
disable (or enable).

The stack trace probe had a similar race, but here the stack trace will
decrement for each time it is called. But this had the read-modify-
write race, where it could stack trace more than the number of times
that was specified. This case we use a cmpxchg to stack trace only the
number of times specified.

The dump probes can still use the old "update_count()" function as
they only run once, and that is controlled by the dump logic
itself.

Link: http://lkml.kernel.org/r/20141118134643.4b550ee4@gandalf.local.home

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-18 23:06:35 -05:00
Rafael J. Wysocki b2b49ccbdd PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected
The number of and dependencies between high-level power management
Kconfig options make life much harder than necessary.  Several
conbinations of them have to be tested and supported, even though
some of those combinations are very rarely used in practice (if
they are used in practice at all).  Moreover, the fact that we
have separate independent Kconfig options for runtime PM and
system suspend is a serious obstacle for integration between
the two frameworks.

To overcome these difficulties, always select PM_RUNTIME if PM_SLEEP
is set.  Among other things, this will allow system suspend callbacks
provided by bus types and device drivers to rely on the runtime PM
framework regardless of the kernel configuration.

Enthusiastically-acked-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-18 23:41:40 +01:00
Alexei Starovoitov 7943c0f329 bpf: remove test map scaffolding and user proper types
proper types and function helpers are ready. Use them in verifier testsuite.
Remove temporary stubs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:44:00 -05:00
Alexei Starovoitov d0003ec01c bpf: allow eBPF programs to use maps
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:44:00 -05:00
Alexei Starovoitov a1854d6ac0 bpf: fix BPF_MAP_LOOKUP_ELEM command return code
fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage
described it in commit b4fc1a460f30("Merge branch 'bpf-next'"):
-----
BPF_MAP_LOOKUP_ELEM
    int bpf_lookup_elem(int fd, void *key, void *value)
    {
        union bpf_attr attr = {
            .map_fd = fd,
            .key = ptr_to_u64(key),
            .value = ptr_to_u64(value),
        };

        return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
    }
    bpf() syscall looks up an element with given key in  a  map  fd.
    If  element  is found it returns zero and stores element's value
    into value.  If element is not found  it  returns  -1  and  sets
    errno to ENOENT.

and further down in manpage:

   ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM,  indicates  that
          element with given key was not found.
-----

In general all BPF commands return ENOENT when map element is not found
(including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with
 flags == BPF_MAP_UPDATE_ONLY)

Subsequent patch adds a testsuite to check return values for all of
these combinations.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:43:59 -05:00
Alexei Starovoitov 28fbcfa08d bpf: add array type of eBPF maps
add new map type BPF_MAP_TYPE_ARRAY and its implementation

- optimized for fastest possible lookup()
  . in the future verifier/JIT may recognize lookup() with constant key
    and optimize it into constant pointer. Can optimize non-constant
    key into direct pointer arithmetic as well, since pointers and
    value_size are constant for the life of the eBPF program.
    In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
    while preserving concurrent access to this map from user space

- two main use cases for array type:
  . 'global' eBPF variables: array of 1 element with key=0 and value is a
    collection of 'global' variables which programs can use to keep the state
    between events
  . aggregation of tracing events into fixed set of buckets

- all array elements pre-allocated and zero initialized at init time

- key as an index in array and can only be 4 byte

- map_delete_elem() returns EINVAL, since elements cannot be deleted

- map_update_elem() replaces elements in an non-atomic way
  (for atomic updates hashtable type should be used instead)

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:43:59 -05:00
Alexei Starovoitov 0f8e4bd8a1 bpf: add hashtable type of eBPF maps
add new map type BPF_MAP_TYPE_HASH and its implementation

- maps are created/destroyed by userspace. Both userspace and eBPF programs
  can lookup/update/delete elements from the map

- eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
  for concurrent updates

- key/value are opaque range of bytes (aligned to 8 bytes)

- user space provides 3 configuration attributes via BPF syscall:
  key_size, value_size, max_entries

- map takes care of allocating/freeing key/value pairs

- map_update_elem() must fail to insert new element when max_entries
  limit is reached to make sure that eBPF programs cannot exhaust memory

- map_update_elem() replaces elements in an atomic way

- optimized for speed of lookup() which can be called multiple times from
  eBPF program which itself is triggered by high volume of events
  . in the future JIT compiler may recognize lookup() call and optimize it
    further, since key_size is constant for life of eBPF program

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:43:25 -05:00
Alexei Starovoitov 3274f52073 bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
either update existing map element or create a new one.
Initially the plan was to add a new command to handle the case of
'create new element if it didn't exist', but 'flags' style looks
cleaner and overall diff is much smaller (more code reused), so add 'flags'
attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
 #define BPF_ANY	0 /* create new element or update existing */
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */

bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
if element already exists.

bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
if element doesn't exist.

Userspace will call it as:
int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
        .flags = flags;
    };

    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

First two bits of 'flags' are used to encode style of bpf_update_elem() command.
Bits 2-63 are reserved for future use.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 13:43:25 -05:00
Tejun Heo eeecbd1971 cgroup: implement cgroup_get_e_css()
Implement cgroup_get_e_css() which finds and gets the effective css
for the specified cgroup and subsystem combination.  This function
always returns a valid pinned css.  This will be used by cgroup
writeback support.

While at it, add comment to cgroup_e_css() to explain why that
function is different from cgroup_get_e_css() and has to test
cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss).

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18 02:49:52 -05:00
Tejun Heo 56c807ba4e cgroup: add cgroup_subsys->css_e_css_changed()
Add a new cgroup_subsys operatoin ->css_e_css_changed().  This is
invoked if any of the effective csses seen from the css's cgroup may
have changed.  This will be used to implement cgroup writeback
support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18 02:49:51 -05:00
Tejun Heo 7d172cc89b cgroup: add cgroup_subsys->css_released()
Add a new cgroup subsys callback css_released().  This is called when
the reference count of the css (cgroup_subsys_state) reaches zero
before RCU scheduling free.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18 02:49:51 -05:00
Tejun Heo db6e305345 cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
When a subsystem is offlined, its entry on @cgrp->subsys[] is cleared
asynchronously.  If cgroup_subtree_control_write() is requested to
enable the subsystem again before the entry is cleared, it has to wait
for the previous offlining to finish and clear the @cgrp->subsys[]
entry before trying to enable the subsystem again.

This is currently done while verifying the input enable / disable
parameters.  This used to be correct but f63070d350 ("cgroup: make
interface files visible iff enabled on cgroup->subtree_control")
breaks it.  The commit is one of the commits implementing subsystem
dependency.

Through subsystem dependency, some subsystems may be enabled and
disabled implicitly in addition to the explicitly requested ones.  The
actual subsystems to be enabled and disabled are determined during
@css_enable/disable calculation.  The current offline wait logic skips
the ones which are already implicitly enabled and then waits for
subsystems in @enable; however, this misses the subsystems which may
be implicitly enabled through dependency from @enable.  If such
implicitly subsystem hasn't yet finished offlining yet, the function
ends up trying to create a css when its @cgrp->subsys[] slot is
already occupied triggering BUG_ON() in init_and_link_css().

Fix it by moving the wait logic after @css_enable is calculated and
waiting for all the subsystems in @css_enable.  This fixes the above
bug as the mask contains all subsystems which are to be enabled
including the ones enabled through dependencies.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f63070d350 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control")
Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18 02:49:51 -05:00
Tejun Heo 755bf5ee86 cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
Make cgroup_subtree_control_write() first calculate new
subtree_control (new_sc), child_subsys_mask (new_ss) and
css_enable/disable masks before applying them to the cgroup.  Also,
store the original subtree_control (old_sc) and child_subsys_mask
(old_ss) and use them to restore the orignal state after failure.

This patch shouldn't cause any behavior changes.  This prepares for a
fix for a bug in the async css offline wait logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18 02:49:50 -05:00
Tejun Heo 0f060deb5c cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cgroup_refresh_child_subsys_mask() calculates and updates the
effective @cgrp->child_subsys_maks according to the current
@cgrp->subtree_control.  Separate out the calculation part into
cgroup_calc_child_subsys_mask().  This will be used to fix a bug in
the async css offline wait logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2014-11-18 02:49:50 -05:00
Pankaj Dubey 7d3dcd042c PM: Kconfig: fix unmet dependency for CPU_PM
If BL_SWITCHER is enabled but SUSPEND and CPU_IDLE is not enabled
we are getting following config warning.

warning: (BL_SWITCHER) selects CPU_PM which has unmet direct
dependencies (SUSPEND || CPU_IDLE)

It has been noticed that CPU_PM dependencies in this file are not really
required so let's remove these dependencies from CPU_PM.

Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-18 02:51:17 +01:00
Markus Elfring 6c45de0d51 PM / hibernate: Deletion of an unnecessary check before the function call "vfree"
The vfree() function performs also input parameter validation. Thus the test
around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-18 02:49:40 +01:00
Rafael J. Wysocki c68cfdc0a5 Merge back 'pm-sleep' material for 3.19-rc1. 2014-11-18 01:23:34 +01:00
Dave Hansen fe3d197f84 x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set.  If there is one patch to
review in the entire series, this is the one.  There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner.  (small FAQ below).

Long Description:

This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers.  The prctl() is an
explicit signal from userspace.

PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.

PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.

PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address.  Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.

Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive.  Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.

==== Why does the hardware even have these Bounds Tables? ====

MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".

They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.

The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.

==== Why not do this in userspace? ====

This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.

Q: Can virtual space simply be reserved for the bounds tables so
   that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
   area needs 4*X GB of virtual space, plus 2GB for the bounds
   directory. If we were to preallocate them for the 128TB of
   user virtual address space, we would need to reserve 512TB+2GB,
   which is larger than the entire virtual address space today.
   This means they can not be reserved ahead of time. Also, a
   single process's pre-popualated bounds directory consumes 2GB
   of virtual *AND* physical memory. IOW, it's completely
   infeasible to prepopulate bounds directories.

Q: Can we preallocate bounds table space at the same time memory
   is allocated which might contain pointers that might eventually
   need bounds tables?
A: This would work if we could hook the site of each and every
   memory allocation syscall. This can be done for small,
   constrained applications. But, it isn't practical at a larger
   scale since a given app has no way of controlling how all the
   parts of the app might allocate memory (think libraries). The
   kernel is really the only place to intercept these calls.

Q: Could a bounds fault be handed to userspace and the tables
   allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
   handler functions and even if mmap() would work it still
   requires locking or nasty tricks to keep track of the
   allocation state there.

Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.

Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18 00:58:53 +01:00
Qiaowei Ren ee1b58d36a mpx: Extend siginfo structure to include bound violation information
This patch adds new fields about bound violation into siginfo
structure. si_lower and si_upper are respectively lower bound
and upper bound when bound violation is caused.

Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151819.1908C900@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18 00:58:53 +01:00
Richard Guy Briggs 0288d7183c audit: convert status version to a feature bitmap
The version field defined in the audit status structure was found to have
limitations in terms of its expressibility of features supported.  This is
distict from the get/set features call to be able to command those features
that are present.

Converting this field from a version number to a feature bitmap will allow
distributions to selectively backport and support certain features and will
allow upstream to be able to deprecate features in the future.  It will allow
userspace clients to first query the kernel for which features are actually
present and supported.  Currently, EINVAL is returned rather than EOPNOTSUP,
which isn't helpful in determining if there was an error in the command, or if
it simply isn't supported yet.  Past features are not represented by this
bitmap, but their use may be converted to EOPNOTSUP if needed in the future.

Since "version" is too generic to convert with a #define, use a union in the
struct status, introducing the member "feature_bitmap" unionized with
"version".

Convert existing AUDIT_VERSION_* macros over to AUDIT_FEATURE_BITMAP*
counterparts, leaving the former for backwards compatibility.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: minor whitespace tweaks]
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-11-17 16:53:51 -05:00
Peter Zijlstra 2565711fb7 perf: Improve the perf_sample_data struct layout
This patch reorders fields in the perf_sample_data struct in order to
minimize the number of cachelines touched in perf_sample_data_init().
It also removes some intializations which are redundant with the code
in kernel/events/core.c

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1411559322-16548-7-git-send-email-eranian@google.com
Cc: cebbert.lkml@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: jolsa@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 11:42:04 +01:00
Stephane Eranian 60e2364e60 perf: Add ability to sample machine state on interrupt
Enable capture of interrupted machine state for each sample.

Registers to sample are passed per event in the sample_regs_intr bitmask.

To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in
sample_type.

The list of available registers is arch dependent and provided by asm/perf_regs.h

Registers are laid out as u64 in the order of the bit order of sample_intr_regs.

This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend
the perf_event_attr struct with a new u64 field.

Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: cebbert.lkml@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 11:41:57 +01:00
Wanpeng Li 36ce98818a sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with
the fair class.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:59:03 +01:00
pang.xunlei c1a2b5f629 sched/deadline: Remove unnecessary definitions in cpudeadline.h
Actually, cpudl_set() and cpudl_init() can never be used without
CONFIG_SMP.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415260327-30465-4-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:59:00 +01:00
pang.xunlei 74e6942fbc sched/cpupri: Remove unnecessary definitions in cpupri.h
Actually, cpupri_set() and cpupri_init() can never be used without
CONFIG_SMP.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:59 +01:00
Wanpeng Li c51b8ab5ad sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
Do not call dequeue_pushable_dl_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.

Actually the patch is the same behavior as commit 311e800e16 ("sched,
rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:57 +01:00
Wanpeng Li cb0b9f2445 sched/fair: Fix stale overloaded status in the busiest group finding logic
Commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.

sgs->group_capacity_factor is used to check if the group is overloaded.

When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.

However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.

This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:56 +01:00
Wanpeng Li 6c1d9410f0 sched: Move p->nr_cpus_allowed check to select_task_rq()
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:55 +01:00
Wolfram Sang a1bd537335 sched/completion: Document when to use wait_for_completion_io_*()
As discussed in [1], accounting IO is meant for blkio only. Document that
so driver authors won't use them for device io.

 [1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470

Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:54 +01:00
Kirill Tkhai 753899183c sched/fair: Kill task_struct::numa_entry and numa_group::task_list
Nobody iterates over numa_group::task_list, this just confuses the readers.

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:48 +01:00
Ingo Molnar e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00
Stanislaw Gruszka 6e998916df sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
Commit d670ec1317 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include <unistd.h>
	#include <sys/syscall.h>
	#include <stdio.h>
	#include <time.h>
	#include <pthread.h>
	#include <stdint.h>
	#include <inttypes.h>

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&barrier, NULL, 2);

		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&sleeptimeabs, &after);
		if (sleepdiff < 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&before, &after);
		if (diffabs < sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:20 +01:00
Peter Zijlstra 23cfa361f3 sched/cputime: Fix cpu_timer_sample_group() double accounting
While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

  cpu_timer_sample_group()
    thread_group_cputimer()
      thread_group_cputime()
        times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:18 +01:00
Peter Zijlstra 7af683350c sched/numa: Avoid selecting oneself as swap target
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:17 +01:00
Mark Rutland 226424eee8 perf: Fix corruption of sibling list with hotplug
When a CPU hotplugged out, we call perf_remove_from_context() (via
perf_event_exit_cpu()) to rip each CPU-bound event out of its PMU's cpu
context, but leave siblings grouped together. Freeing of these events is
left to the mercy of the usual refcounting.

When a CPU-bound event's refcount drops to zero we cross-call to
__perf_remove_from_context() to clean it up, detaching grouped siblings.

This works when the relevant CPU is online, but will fail if the CPU is
currently offline, and we won't detach the event from its siblings
before freeing the event, leaving the sibling list corrupt. If the
sibling list is later walked (e.g. because the CPU cam online again
before a remaining sibling's refcount drops to zero), we will walk the
now corrupted siblings list, potentially dereferencing garbage values.

Given that the events should never be scheduled again (as we removed
them from their context), we can simply detatch siblings when the CPU
goes down in the first place. If the CPU comes back online, the
redundant call to __perf_remove_from_context() is safe.

Reported-by: Drew Richardson <drew.richardson@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: vincent.weaver@maine.edu
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415203904-25308-2-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 09:45:46 +01:00
Kevin Hilman 977d2fa6b2 PM / Runtime: Kconfig: move ia64 dependency to arch/ia64/Kconfig
The IA64_HP_SIM dependency on PM_RUNTIME should be done in the arch
Kconfig instead of in the PM core.  Move it accordingly.

NOTE: arch/ia64/Kconfig currently does a 'select PM', which since
commit 1eb208aea3 (PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP ||
CONFIG_PM_RUNTIME)) is effectively a noop unless PM_SLEEP or
PM_RUNTIME are set elsewhere.

Signed-off-by: Kevin Hilman <khilman@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-11-15 00:39:56 +01:00
Linus Torvalds 78646f62db ACPI and power management fixes for 3.18-rc5
- Fix a crash in the suspend-to-idle code path introduced by a
    recent commit that forgot to check a pointer against NULL before
    dereferencing it (Dmitry Eremin-Solenikov).
 
  - Fix a boot crash on Exynos5 introduced by a recent commit making
    that platform use generic Device Tree bindings for power domains
    which exposed a weakness in the generic power domains framework
    leading to that crash (Ulf Hansson).
 
  - Fix a crash during system resume on systems where cpufreq depends
    on Operation Performance Points (OPP) for functionality, but
    CONFIG_OPP is not set.  This leads the cpufreq driver registration
    to fail, but the resume code attempts to restore the pre-suspend
    cpufreq configuration (which does not exist) nevertheless and
    crashes.  From Geert Uytterhoeven.
 
  - Add a new ACPI blacklist entry for Dell Vostro 3546 that has
    problems if it is reported as Windows 8 compatible to the BIOS
    (Adam Lee).
 
  - Fix swapped arguments in an error message in the cpufreq-dt
    driver (Abhilash Kesavan).
 
  - Fix up the prototypes of new callbacks in struct generic_pm_domain
    to make them more useful.  Users of those callbacks will be added
    in 3.19 and it's better for them to be based on the correct struct
    definition in mainline from the start.  From Ulf Hansson and
    Kevin Hilman.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJUZhtnAAoJEILEb/54YlRxx1EP/0Rk7pJUHeOMmdyXyY7B+n+f
 MlXHVMDhskT370fsdTGbpeYb5ATr5kGatfhr+vyDQmBtxdw7lDJxKq54s6kmmIL3
 SEMRRb4NtkPsdDE7zq985JmjsrnHtKxC5NjSUwEGxdyyfAZxll4mrZL6RrqXCu44
 L+qdVXRffCCrJDXZl5FZUpSZ3ZUc+xTiaDy7ObjLe2bwmzvBOAwS2flBMKxN9X+e
 khlGdQZ0e9T2Y3IXriHxHMui8OVbkPyYZkW1aubCd0HwuTMP7sebosX/2JWdJOmg
 q6bGcvPlBwXDRoShlzFO8CN5w5E8fIe0vfPcg9SB3s21S7rJEbYQX/5ytm107aJj
 Ysv7mcb2dAHG0V3J7hxhkS+7UNPxfk3G+8frxW2UQ6eIDlZkBORIUhGCzeSbIGYM
 aIKiomN4jGuPeaOkEnKl4RwMlzjuzAs2V06viffbq63eyWBvtHDW8M5bdq901pXp
 1jOT7yKqLzOZYqcYaLr3z+IBw/+hfuG/FdCp3uGyFqeHPBNIP3BfFnWm6A6E13b+
 aC6gvhQHojT7L2gqIBJ+Qn0EiRWNqwoLk6w6DLDYJna/hYyoXq0BKv+/x2OegItU
 ENKYVpfmSt3YsEhcTBW4h5IpUvK07o5Oa3nTxen6924Im61dMyaSUDD5DiaqCgXO
 bVJTsF983hBZGTy0IMX/
 =wQxT
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes from Rafael Wysocki:
 "These are three regression fixes, two recent (generic power domains,
  suspend-to-idle) and one older (cpufreq), an ACPI blacklist entry for
  one more machine having problems with Windows 8 compatibility, a minor
  cpufreq driver fix (cpufreq-dt) and a fixup for new callback
  definitions (generic power domains).

  Specifics:

   - Fix a crash in the suspend-to-idle code path introduced by a recent
     commit that forgot to check a pointer against NULL before
     dereferencing it (Dmitry Eremin-Solenikov).

   - Fix a boot crash on Exynos5 introduced by a recent commit making
     that platform use generic Device Tree bindings for power domains
     which exposed a weakness in the generic power domains framework
     leading to that crash (Ulf Hansson).

   - Fix a crash during system resume on systems where cpufreq depends
     on Operation Performance Points (OPP) for functionality, but
     CONFIG_OPP is not set.  This leads the cpufreq driver registration
     to fail, but the resume code attempts to restore the pre-suspend
     cpufreq configuration (which does not exist) nevertheless and
     crashes.  From Geert Uytterhoeven.

   - Add a new ACPI blacklist entry for Dell Vostro 3546 that has
     problems if it is reported as Windows 8 compatible to the BIOS
     (Adam Lee).

   - Fix swapped arguments in an error message in the cpufreq-dt driver
     (Abhilash Kesavan).

   - Fix up the prototypes of new callbacks in struct generic_pm_domain
     to make them more useful.  Users of those callbacks will be added
     in 3.19 and it's better for them to be based on the correct struct
     definition in mainline from the start.  From Ulf Hansson and Kevin
     Hilman"

* tag 'pm+acpi-3.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / Domains: Fix initial default state of the need_restore flag
  PM / sleep: Fix entering suspend-to-IDLE if no freeze_oops is set
  PM / Domains: Change prototype for the attach and detach callbacks
  cpufreq: Avoid crash in resume on SMP without OPP
  cpufreq: cpufreq-dt: Fix arguments in clock failure error message
  ACPI / blacklist: blacklist Win8 OSI for Dell Vostro 3546
2014-11-14 13:38:02 -08:00
Byungchul Park 4526d0676a function_graph: Fix micro seconds notations
Usually, "msecs" notation means milli-seconds, and "usecs" notation
means micro-seconds. Since the unit used in the code is micro-seconds,
the notation should be replaced from msecs to usecs.

Link: http://lkml.kernel.org/r/1415171926-9782-2-git-send-email-byungchul.park@lge.com

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-14 07:56:03 -05:00
Daniel Bristot de Oliveira 678f845ed0 ftrace-graph: show latency-format on print_graph_irq()
On the function_graph tracer, the print_graph_irq() function prints a
trace line with the flag ==========> on an irq handler entry, and the
flag <========== on an irq handler return.

But when the latency-format is enable, it is not printing the
latency-format flags, causing the following error in the trace output:

 0)   ==========> |
 0)  d...              |  smp_apic_timer_interrupt() {

This patch fixes this issue by printing the latency-format flags when
it is enable.

Link: http://lkml.kernel.org/r/7c2e226dac20c940b6242178fab7f0e3c9b5ce58.1415233316.git.bristot@redhat.com

Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-14 07:56:02 -05:00
Rasmus Villemoes 1177e43641 trace: Replace single-character seq_puts with seq_putc
Printing a single character to a seqfile might as well be done with
seq_putc instead of seq_puts; this avoids a strlen() call and a memory
access. It also shaves another few bytes off the generated code.

Link: http://lkml.kernel.org/r/1415479332-25944-4-git-send-email-linux@rasmusvillemoes.dk

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-14 07:55:55 -05:00
David S. Miller 076ce44825 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/chelsio/cxgb4vf/sge.c
	drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c

sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.

ixgbe_phy.c was a set of overlapping whitespace changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 01:01:12 -05:00
Rasmus Villemoes d79ac28fde tracing: Merge consecutive seq_puts calls
Consecutive seq_puts calls with literal strings can be merged to a
single call. This reduces the size of the generated code, and can also
lead to slight .rodata reduction (because of fewer nul and padding
bytes). It should also shave a off a few clock cycles.

Link: http://lkml.kernel.org/r/1415479332-25944-3-git-send-email-linux@rasmusvillemoes.dk

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-13 21:33:34 -05:00
Rasmus Villemoes fa6f0cc751 tracing: Replace seq_printf by simpler equivalents
Using seq_printf to print a simple string or a single character is a
lot more expensive than it needs to be, since seq_puts and seq_putc
exist.

These patches do

  seq_printf(m, s) -> seq_puts(m, s)
  seq_printf(m, "%s", s) -> seq_puts(m, s)
  seq_printf(m, "%c", c) -> seq_putc(m, c)

Subsequent patches will simplify further.

Link: http://lkml.kernel.org/r/1415479332-25944-2-git-send-email-linux@rasmusvillemoes.dk

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-13 21:32:19 -05:00
Daniel Thompson 8520dedbbf tracing: kdb: Fix kernel livelock with empty buffers
Currently kdb's ftdump command will livelock by constantly printk'ing
the empty string at KERN_EMERG level if it run when the ftrace system is
not in use. This occurs because trace_empty() never returns false when
the ring buffers are left at the start of a non-consuming read [launched
by ring_buffer_read_start()].

This patch changes the loop exit condition to use the result of
trace_find_next_entry_inc(). Effectively this switches the non-consuming
kdb dumper to follow the approach of the non-consuming userspace
interface [s_next()] rather than the consuming ftrace_dump().

Link: http://lkml.kernel.org/r/1415277716-19419-3-git-send-email-daniel.thompson@linaro.org

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-13 21:27:25 -05:00
Daniel Thompson c270cc75cd tracing: kdb: Fix kernel panic during ftdump
Currently kdb's ftdump command unconditionally crashes due to a null
pointer de-reference whenever the command is run. This in turn causes
the kernel to panic.

The abridged stacktrace (gathered with ARCH=arm) is:

--- cut here ---
[<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440)
[<c02132dc>] (die) from [<c0952eb8>]
(__do_kernel_fault.part.11+0x74/0x84)
[<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>]
(do_page_fault+0x1d0/0x3c4)
[<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac)

[<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60)
Exception stack(0xc0deba88 to 0xc0debad0)
ba80:                   e8c29180 00000001 e9854304 e9854300 c0f567d8
c0df2580
baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000
c0debad0
bac0: 0000672e c02f4d60 60000193 ffffffff
[<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8)
[<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698)
[<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784)
[<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490)
--- cut here ---

The NULL deref occurs due to the initialized use of struct trace_iter's
buffer_iter member.

This is a regression, albeit a fairly elderly one. It was introduced
by commit 6d158a813e ("tracing: Remove NR_CPUS array from
trace_iterator").

This patch solves this by providing a collection of ring_buffer_iter(s)
and using this to initialize buffer_iter. Note that static allocation
is used solely because the trace_iter itself is also static allocated.
Static allocation also means that we have to NULL-ify the pointer during
cleanup to avoid use-after-free problems.

Link: http://lkml.kernel.org/r/1415277716-19419-2-git-send-email-daniel.thompson@linaro.org

Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-13 21:24:24 -05:00
Luis Claudio R. Goncalves 933ff9f202 tracing: Fix traceoff_on_warning handling on boot command line
According to the documentation, adding "traceoff_on_warning" to the boot
command line should be enough to enable the feature. But right now it is
necessary to specify "traceoff_on_warning=". Along with fixing that, also
verify if the value passed, if any, is either "0" or "off".

Link: http://lkml.kernel.org/r/20141112231400.GL12281@uudg.org

Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-13 21:03:41 -05:00
Steven Rostedt (Red Hat) fe578ba36f ftrace: Have the control_ops get a trampoline
With the new logic, if only a single user of ftrace function hooks is
used, it will get its own trampoline assigned to it.

The problem is that the control_ops is an indirect ops that perf ops
uses. What that means is that when perf registers its ops with
register_ftrace_function(), it has the CONTROL flag set and gets added
to the control list instead of the global ftrace list. The control_ops
gets added to that instead and the mcount trampoline calls the control_ops
function. The control_ops function will iterate the control list and
call the ops functions that are attached to it.

But currently the trampoline is added to the perf ops and not the
control ops, and when ftrace tries to find a trampoline hook for it,
it fails to find one and gives the following splat:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 10133 at kernel/trace/ftrace.c:2033 ftrace_get_addr_new+0x6f/0xc0()
 Modules linked in: [...]
 CPU: 0 PID: 10133 Comm: perf Tainted: P               3.18.0-rc1-test+ #388
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
  00000000000007f1 ffff8800c2643bc8 ffffffff814fca6e ffff88011ea0ed01
  0000000000000000 ffff8800c2643c08 ffffffff81041ffd 0000000000000000
  ffffffff810c388c ffffffff81a5a350 ffff880119b00000 ffffffff810001c8
 Call Trace:
  [<ffffffff814fca6e>] dump_stack+0x46/0x58
  [<ffffffff81041ffd>] warn_slowpath_common+0x81/0x9b
  [<ffffffff810c388c>] ? ftrace_get_addr_new+0x6f/0xc0
  [<ffffffff810001c8>] ? 0xffffffff810001c8
  [<ffffffff81042031>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff810c388c>] ftrace_get_addr_new+0x6f/0xc0
  [<ffffffff8102e938>] ftrace_replace_code+0xd6/0x334
  [<ffffffff810c4116>] ftrace_modify_all_code+0x41/0xc5
  [<ffffffff8102eba6>] arch_ftrace_update_code+0x10/0x19
  [<ffffffff810c293c>] ftrace_run_update_code+0x21/0x42
  [<ffffffff810c298f>] ftrace_startup_enable+0x32/0x34
  [<ffffffff810c3049>] ftrace_startup+0x14e/0x15a
  [<ffffffff810c307c>] register_ftrace_function+0x27/0x40
  [<ffffffff810dc118>] perf_ftrace_event_register+0x3e/0xee
  [<ffffffff810dbfbe>] perf_trace_init+0x29d/0x2a9
  [<ffffffff810eb422>] perf_tp_event_init+0x27/0x3a
  [<ffffffff810f18bc>] perf_init_event+0x9e/0xed
  [<ffffffff810f1ba4>] perf_event_alloc+0x299/0x330
  [<ffffffff810f236b>] SYSC_perf_event_open+0x3ee/0x816
  [<ffffffff8115a066>] ? mntput+0x2d/0x2f
  [<ffffffff81142b00>] ? __fput+0xa7/0x1b2
  [<ffffffff81091300>] ? do_gettimeofday+0x22/0x3a
  [<ffffffff810f279c>] SyS_perf_event_open+0x9/0xb
  [<ffffffff81502a92>] system_call_fastpath+0x12/0x17
 ---[ end trace 81a53565150e4982 ]---
 Bad trampoline accounting at: ffffffff810001c8 (run_init_process+0x0/0x2d) (10000001)

Update the control_ops trampoline instead of the perf ops one.

Reported-by: lkp@01.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-13 19:40:56 -05:00
Xie XiuQi bc53a3f46d kernel/panic.c: update comments for print_tainted
Commit 69361eef90 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
but failed to update the comments for print_tainted().  So, update the
comments.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13 16:17:06 -08:00
Paul E. McKenney 9ea6c58856 Merge branches 'torture.2014.11.03a', 'cpu.2014.11.03a', 'doc.2014.11.13a', 'fixes.2014.11.13a', 'signal.2014.10.29a' and 'rt.2014.10.29a' into HEAD
cpu.2014.11.03a: Changes for per-CPU variables.
doc.2014.11.13a: Documentation updates.
fixes.2014.11.13a: Miscellaneous fixes.
signal.2014.10.29a: Signal changes.
rt.2014.10.29a: Real-time changes.
torture.2014.11.03a: torture-test changes.
2014-11-13 10:39:04 -08:00
Paul E. McKenney 60ced4950c rcu: Fix FIXME in rcu_tasks_kthread()
This commit affines rcu_tasks_kthread() to the housekeeping CPUs
in CONFIG_NO_HZ_FULL builds.  This is just a default, so systems
administrators are free to put this kthread somewhere else if they wish.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-11-13 10:35:41 -08:00
Linus Torvalds 911883759f Merge branch 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "After he sent the initial audit pull request for 3.18, Eric asked me
  to take over the management of the audit tree, hence this pull request
  to fix a couple of problems with audit.

  As you can see below, the changes are minimal: adding some whitespace
  to a string so userspace parses it correctly, and fixing a problem
  with audit's usage of fsnotify that was causing audit watch rules to
  be lost.  Neither of these patches were very controversial on the
  mailing lists and they fix real problems, getting them into 3.18 would
  be a good thing"

* 'stable-3.18' of git://git.infradead.org/users/pcmoore/audit:
  audit: keep inode pinned
  audit: AUDIT_FEATURE_CHANGE message format missing delimiting space
2014-11-13 09:36:39 -08:00
Miklos Szeredi 799b601451 audit: keep inode pinned
Audit rules disappear when an inode they watch is evicted from the cache.
This is likely not what we want.

The guilty commit is "fsnotify: allow marks to not pin inodes in core",
which didn't take into account that audit_tree adds watches with a zero
mask.

Adding any mask should fix this.

Fixes: 90b1e7a578 ("fsnotify: allow marks to not pin inodes in core")
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org # 2.6.36+
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-11-11 14:20:22 -05:00
Jiang Liu 26488b3723 tracing: Add entry->next_cpu to trace_ctxwake_bin()
Function trace_ctxwake_bin() misses ctx_switch_entry->next_cpu field,
so user will get stale value for "next_cpu".

Link: http://lkml.kernel.org/p/1377176379-27908-1-git-send-email-liuj97@gmail.com

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-11 12:43:34 -05:00
Steven Rostedt (Red Hat) 243f7610a6 tracing: Move tracing_sched_{switch,wakeup}() into wakeup tracer
The only code that references tracing_sched_switch_trace() and
tracing_sched_wakeup_trace() is the wakeup latency tracer. Those
two functions use to belong to the sched_switch tracer which has
long been removed. These functions were left behind because the
wakeup latency tracer used them. But since the wakeup latency tracer
is the only one to use them, they should be static functions inside
that code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-11 12:43:15 -05:00
Oleg Nesterov 458faf0b88 tracing: Kill the dead code in probe_sched_switch() and probe_sched_wakeup()
After the previous patch it is clear that "tracer_enabled" can never be
true, we can remove the "if (tracer_enabled)" code in probe_sched_switch()
and probe_sched_wakeup(). Plus we can obviously remove tracer_enabled,
ctx_trace, and sched_stopped as well.

Link: http://lkml.kernel.org/p/20140723193503.GA30217@redhat.com

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-11 12:42:44 -05:00
Oleg Nesterov 632537256e tracing: Kill tracing_{start,stop}_sched_switch_record() and tracing_sched_switch_assign_trace()
tracing_{start,stop}_sched_switch_record() have no callers since
87d80de280 "tracing: Remove obsolete sched_switch tracer".

The last caller of tracing_sched_switch_assign_trace() was removed
by 30dbb20e68 "tracing: Remove boot tracer".

Link: http://lkml.kernel.org/p/20140723193501.GA30214@redhat.com

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-11 12:42:23 -05:00
Steven Rostedt (Red Hat) 4fd3279b48 ftrace: Add more information to ftrace_bug() output
With the introduction of the dynamic trampolines, it is useful that if
things go wrong that ftrace_bug() produces more information about what
the current state is. This can help debug issues that may arise.

Ftrace has lots of checks to make sure that the state of the system it
touchs is exactly what it expects it to be. When it detects an abnormality
it calls ftrace_bug() and disables itself to prevent any further damage.
It is crucial that ftrace_bug() produces sufficient information that
can be used to debug the situation.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-11 12:42:13 -05:00
Steven Rostedt (Red Hat) 12cce594fa ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines
When the static ftrace_ops (like function tracer) enables tracing, and it
is the only callback that is referencing a function, a trampoline is
dynamically allocated to the function that calls the callback directly
instead of calling a loop function that iterates over all the registered
ftrace ops (if more than one ops is registered).

But when it comes to dynamically allocated ftrace_ops, where they may be
freed, on a CONFIG_PREEMPT kernel there's no way to know when it is safe
to free the trampoline. If a task was preempted while executing on the
trampoline, there's currently no way to know when it will be off that
trampoline.

But this is not true when it comes to !CONFIG_PREEMPT. The current method
of calling schedule_on_each_cpu() will force tasks off the trampoline,
becaues they can not schedule while on it (kernel preemption is not
configured). That means it is safe to free a dynamically allocated
ftrace ops trampoline when CONFIG_PREEMPT is not configured.

Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-11 12:41:52 -05:00
Rusty Russell 18eb74fa94 params: cleanup sysfs allocation
commit 63662139e5 attempted to patch a
leak (which would only happen on OOM, ie. never), but it didn't quite
work.

This rewrites the code to be as simple as possible.  add_sysfs_param()
adds a parameter.  If it fails, it's the caller's responsibility to
clean up the parameters which already exist.

The kzalloc-then-always-krealloc pattern is perhaps overly simplistic,
but this code has clearly confused people.  It worked on me...

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-11-11 17:07:47 +10:30
Ionut Alexa 6da0b56515 kernel:module Fix coding style errors and warnings.
Fixed codin style errors and warnings. Changes printk with
print_debug/warn. Changed seq_printf to seq_puts.

Signed-off-by: Ionut Alexa <ionut.m.alexa@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed bogus KERN_DEFAULT conversion)
2014-11-11 17:07:47 +10:30