Merge batch of fixes from Andrew Morton:
"The simple_open() cleanup was held back while I wanted for laggards to
merge things.
I still need to send a few checkpoint/restore patches. I've been
wobbly about merging them because I'm wobbly about the overall
prospects for success of the project. But after speaking with Pavel
at the LSF conference, it sounds like they're further toward
completion than I feared - apparently davem is at the "has stopped
complaining" stage regarding the net changes. So I need to go back
and re-review those patchs and their (lengthy) discussion."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (16 patches)
memcg swap: use mem_cgroup_uncharge_swap fix
backlight: add driver for DA9052/53 PMIC v1
C6X: use set_current_blocked() and block_sigmask()
MAINTAINERS: add entry for sparse checker
MAINTAINERS: fix REMOTEPROC F: typo
alpha: use set_current_blocked() and block_sigmask()
simple_open: automatically convert to simple_open()
scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
libfs: add simple_open()
hugetlbfs: remove unregister_filesystem() when initializing module
drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback
fs/xattr.c:setxattr(): improve handling of allocation failures
fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed
fs/xattr.c: suppress page allocation failure warnings from sys_listxattr()
sysrq: use SEND_SIG_FORCED instead of force_sig()
proc: fix mount -t proc -o AAA
Many users of debugfs copy the implementation of default_open() when
they want to support a custom read/write function op. This leads to a
proliferation of the default_open() implementation across the entire
tree.
Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().
This replacement was done with the following semantic patch:
<smpl>
@ open @
identifier open_f != simple_open;
identifier i, f;
@@
-int open_f(struct inode *i, struct file *f)
-{
(
-if (i->i_private)
-f->private_data = i->i_private;
|
-f->private_data = i->i_private;
)
-return 0;
-}
@ has_open depends on open @
identifier fops;
identifier open.open_f;
@@
struct file_operations fops = {
...
-.open = open_f,
+.open = simple_open,
...
};
</smpl>
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading the trace file, the records of each of the per_cpu buffers
are examined to find the next event to print out. At the point of looking
at the event, the size of the event is recorded. But if the first event is
chosen, the other events in the other CPU buffers will reset the event size
that is stored in the iterator descriptor, causing the event size passed to
the output functions to be incorrect.
In most cases this is not a problem, but for the case of stack traces, it
is. With the change to the stack tracing to record a dynamic number of
back traces, the output depends on the size of the entry instead of the
fixed 8 back traces. When the entry size is not correct, the back traces
would not be fully printed.
Note, reading from the per-cpu trace files were not affected.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
Today's -next fails to build for me:
CC kernel/trace/trace_export.o
In file included from kernel/trace/trace_export.c:197: kernel/trace/trace_entries.h:58: error: 'perf_ftrace_event_register' undeclared here (not in a function)
make[2]: *** [kernel/trace/trace_export.o] Error 1
make[1]: *** [kernel/trace] Error 2
make: *** [kernel] Error 2
because as of ced390 (ftrace, perf: Add support to use function
tracepoint in perf) perf_trace_event_register() is declared in trace.h
only if CONFIG_PERF_EVENTS is enabled but I don't have that set.
Ensure that we always have a definition of perf_trace_event_register()
by making the definition unconditional.
Link: http://lkml.kernel.org/r/1330426967-17067-1-git-send-email-broonie@opensource.wolfsonmicro.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When CONFIG_DYNAMIC_FTRACE is not set, some archs (ARM) test
the variable function_trace_function to determine if it should
call the function tracer. If it is not set to ftrace_stub, then
it will call the function and return, and not call the function
graph tracer.
But some of these archs (ARM) do not have the assembly code
to test if function tracing is enabled or not (quick stop of tracing)
and it calls the helper routine ftrace_test_stop_func() instead.
If function tracer is enabled and then disabled, the variable
ftrace_trace_function is still set to the helper routine
ftrace_test_stop_func(), and not to ftrace_stub. This will
prevent the function graph tracer from ever running.
Output before patch
/debug/tracing # echo function > current_tracer
/debug/tracing # echo function_graph > current_tracer
/debug/tracing # cat trace
Output after patch
/debug/tracing # echo function > current_tracer
/debug/tracing # echo function_graph > current_tracer
/debug/tracing # cat trace
0) ! 253.375 us | } /* irq_enter */
0) | generic_handle_irq() {
0) | handle_fasteoi_irq() {
0) 9.208 us | _raw_spin_lock();
0) | handle_irq_event() {
0) | handle_irq_event_percpu() {
Signed-off-by: Rajesh Bhagat <rajesh.lnx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_dump() (called by ftrace_dump_on_oops) disables interrupts
as it dumps its output to the console, it can keep interrupts disabled
for long periods of time. This is likely to trigger the NMI watchdog,
and it can disrupt the output of critical data.
Add a touch_nmi_watchdog() to each event that is written to the screen
to keep the NMI watchdog from affecting the output.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On PowerPC, FUNCTION_TRACER selects FRAME_POINTER, even
though the architecture does not support it.
This causes the following warning:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP && FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS)
So remove the warning by adding the extra condition
"if !PPC" to FUNCTION_TRACER for FRAME_POINTER selection
Link: http://lkml.kernel.org/r/1330330101-8618-1-git-send-email-gerlando.falauto@keymile.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the ring-buffer code is being used by other facilities in the
kernel, having tracing_on file disable *all* buffers is not a desired
affect. It should only disable the ftrace buffers that are being used.
Move the code into the trace.c file and use the buffer disabling
for tracing_on() and tracing_off(). This way only the ftrace buffers
will be affected by them and other kernel utilities will not be
confused to why their output suddenly stopped.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding support to filter function trace event via perf
interface. It is now possible to use filter interface
in the perf tool like:
perf record -e ftrace:function --filter="(ip == mm_*)" ls
The filter syntax is restricted to the the 'ip' field only,
and following operators are accepted '==' '!=' '||', ending
up with the filter strings like:
ip == f1[, ]f2 ... || ip != f3[, ]f4 ...
with comma ',' or space ' ' as a function separator. If the
space ' ' is used as a separator, the right side of the
assignment needs to be enclosed in double quotes '"', e.g.:
perf record -e ftrace:function --filter '(ip == do_execve,sys_*,ext*)' ls
perf record -e ftrace:function --filter '(ip == "do_execve,sys_*,ext*")' ls
perf record -e ftrace:function --filter '(ip == "do_execve sys_* ext*")' ls
The '==' operator adds trace filter with same effect as would
be added via set_ftrace_filter file.
The '!=' operator adds trace filter with same effect as would
be added via set_ftrace_notrace file.
The right side of the '!=', '==' operators is list of functions
or regexp. to be added to filter separated by space.
The '||' operator is used for connecting multiple filter definitions
together. It is possible to have more than one '==' and '!='
operators within one filter string.
Link: http://lkml.kernel.org/r/1329317514-8131-8-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding FILTER_TRACE_FN event field type for function tracepoint
event, so it can be properly recognized within filtering code.
Currently all fields of ftrace subsystem events share the common
field type FILTER_OTHER. Since the function trace fields need
special care within the filtering code we need to recognize it
properly, hence adding the FILTER_TRACE_FN event type.
Adding filter parameter to the FTRACE_ENTRY macro, to specify the
filter field type for the event.
Link: http://lkml.kernel.org/r/1329317514-8131-7-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding perf registration support for the ftrace function event,
so it is now possible to register it via perf interface.
The perf_event struct statically contains ftrace_ops as a handle
for function tracer. The function tracer is registered/unregistered
in open/close actions.
To be efficient, we enable/disable ftrace_ops each time the traced
process is scheduled in/out (via TRACE_REG_PERF_(ADD|DELL) handlers).
This way tracing is enabled only when the process is running.
Intentionally using this way instead of the event's hw state
PERF_HES_STOPPED, which would not disable the ftrace_ops.
It is now possible to use function trace within perf commands
like:
perf record -e ftrace:function ls
perf stat -e ftrace:function ls
Allowed only for root.
Link: http://lkml.kernel.org/r/1329317514-8131-6-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding FTRACE_ENTRY_REG macro so particular ftrace entries
could specify registration function and thus become accesible
via perf.
This will be used in upcomming patch for function trace.
Link: http://lkml.kernel.org/r/1329317514-8131-5-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding TRACE_REG_PERF_ADD and TRACE_REG_PERF_DEL to handle
perf event schedule in/out actions.
The add action is invoked for when the perf event is scheduled in,
while the del action is invoked when the event is scheduled out.
Link: http://lkml.kernel.org/r/1329317514-8131-4-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding TRACE_REG_PERF_OPEN and TRACE_REG_PERF_CLOSE to differentiate
register/unregister from open/close actions.
The register/unregister actions are invoked for the first/last
tracepoint user when opening/closing the event.
The open/close actions are invoked for each tracepoint user when
opening/closing the event.
Link: http://lkml.kernel.org/r/1329317514-8131-3-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a way to temporarily enable/disable ftrace_ops. The change
follows the same way as 'global' ftrace_ops are done.
Introducing 2 global ftrace_ops - control_ops and ftrace_control_list
which take over all ftrace_ops registered with FTRACE_OPS_FL_CONTROL
flag. In addition new per cpu flag called 'disabled' is also added to
ftrace_ops to provide the control information for each cpu.
When ftrace_ops with FTRACE_OPS_FL_CONTROL is registered, it is
set as disabled for all cpus.
The ftrace_control_list contains all the registered 'control' ftrace_ops.
The control_ops provides function which iterates ftrace_control_list
and does the check for 'disabled' flag on current cpu.
Adding 3 inline functions:
ftrace_function_local_disable/ftrace_function_local_enable
- enable/disable the ftrace_ops on current cpu
ftrace_function_local_disabled
- get disabled ftrace_ops::disabled value for current cpu
Link: http://lkml.kernel.org/r/1329317514-8131-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If more than one __print_*() function is used in a tracepoint
(__print_flags(), __print_symbols(), etc), then the temp seq buffer will
not be zero on entry. Using the temp seq buffer's length to know if
data has been printed or not in the current function is incorrect and
may produce incorrect results.
Currently, no in-tree tracepoint causes this bug, but new ones may
be created.
Cc: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If __print_flags() is used after another __print_*() function, the
temp seq_file buffer will not be empty on entry, and the delimiter will
be printed even though there's just one field. We get something like:
|S
instead of just:
S
This is because the length of the temp seq buffer is used to determine
if the delimiter is printed or not. But this algorithm fails when
the seq buffer is not empty on entry, and the delimiter will be printed
because it thinks that a previous field was already printed.
Link: http://lkml.kernel.org/r/1329650167-480655-1-git-send-email-avagin@openvz.org
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.
The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107
Link: http://lkml.kernel.org/r/1322600880.1534.347.camel@localhost.localdomain
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Actually, sched_switch function tracer is merged into wakeup/wakeup_rt
Update 'mini-HOWTO' for ftrace(Kernel function tracer).
If we want to trace "sched:sched_switch" to trace sched_switch func,
We may utilize event option.(e.g: trace-cmd list -e | grep sched)
This patch is based on Linux-3.3.rc2-SMP-PREEMPT
Link: http://lkml.kernel.org/r/1328695537-15081-1-git-send-email-geunsik.lim@gmail.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the ftrace_set_filter and ftrace_set_notrace functions
do not return any return code. So there's no way for ftrace_ops
user to tell wether the filter was correctly applied.
The set_ftrace_filter interface returns error in case the filter
did not match:
# echo krava > set_ftrace_filter
bash: echo: write error: Invalid argument
Changing both ftrace_set_filter and ftrace_set_notrace functions
to return zero if the filter was applied correctly or -E* values
in case of error.
Link: http://lkml.kernel.org/r/1325495060-6402-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
perf tools: Fix compile error on x86_64 Ubuntu
perf report: Fix --stdio output alignment when --showcpuutilization used
perf annotate: Get rid of field_sep check
perf annotate: Fix usage string
perf kmem: Fix a memory leak
perf kmem: Add missing closedir() calls
perf top: Add error message for EMFILE
perf test: Change type of '-v' option to INCR
perf script: Add missing closedir() calls
tracing: Fix compile error when static ftrace is enabled
recordmcount: Fix handling of elf64 big-endian objects.
perf tools: Add const.h to MANIFEST to make perf-tar-src-pkg work again
perf tools: Add support for guest/host-only profiling
perf kvm: Do guest-only counting by default
perf top: Don't update total_period on process_sample
perf hists: Stop using 'self' for struct hist_entry
perf hists: Rename total_session to total_period
x86: Add counter when debug stack is used with interrupts enabled
x86: Allow NMIs to hit breakpoints in i386
x86: Keep current stack in NMI breakpoints
...
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (106 commits)
perf kvm: Fix copy & paste error in description
perf script: Kill script_spec__delete
perf top: Fix a memory leak
perf stat: Introduce get_ratio_color() helper
perf session: Remove impossible condition check
perf tools: Fix feature-bits rework fallout, remove unused variable
perf script: Add generic perl handler to process events
perf tools: Use for_each_set_bit() to iterate over feature flags
perf tools: Unify handling of features when writing feature section
perf report: Accept fifos as input file
perf tools: Moving code in some files
perf tools: Fix out-of-bound access to struct perf_session
perf tools: Continue processing header on unknown features
perf tools: Improve macros for struct feature_ops
perf: builtin-record: Document and check that mmap_pages must be a power of two.
perf: builtin-record: Provide advice if mmap'ing fails with EPERM.
perf tools: Fix truncated annotation
perf script: look up thread using tid instead of pid
perf tools: Look up thread names for system wide profiling
perf tools: Fix comm for processes with named threads
...
There are four places where new filter for a given filter string is
created, which involves several different steps. This patch factors
those steps into create_[system_]filter() functions which in turn make
use of create_filter_{start|finish}() for common parts.
The only functional change is that if replace_filter_string() is
requested and fails, creation fails without any side effect instead of
being ignored.
Note that system filter is now installed after the processing is
complete which makes freeing before and then restoring filter string
on error unncessary.
-v2: Rebased to resolve conflict with 49aa29513e and updated both
create_filter() functions to always set *filterp instead of
requiring the caller to clear it to %NULL on entry.
Link: http://lkml.kernel.org/r/1323988305-1469-2-git-send-email-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add stacktrace_filter= to the kernel command line that lets
the user pick specific functions to check the stack on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change set_ftrace_early_filter() to ftrace_set_early_filter()
and make it a global function. This will allow other subsystems
in the kernel to be able to enable function tracing at start
up and reuse the ftrace function parsing code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The stack_tracer is used to look at every function and check
if the current stack is bigger than the last recorded max stack size.
When a new max is found, then it saves that stack off.
Currently the stack tracer is limited by the global_ops of
the function tracer. As the stack tracer has nothing to do with
the ftrace function tracer, except that it uses it as its internal
engine, the stack tracer should have its own list.
A new file is added to the tracing debugfs directory called:
stack_trace_filter
that can be used to select which functions you want to check the stack
on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The set_ftrace_filter shows "hashed" functions, which are functions
that are added with operations to them (like traceon and traceoff).
As other subsystems may be able to show what functions they are
using for function tracing, the hash items should no longer
be shown just because the FILTER flag is set. As they have nothing
to do with other subsystems filters.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracer is set up to allow any other subsystem (like perf)
to use it. Ftrace already has a way to list what functions are enabled
by the global_ops. It would be very helpful to let other users of
the function tracer to be able to use the same code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are two types of hashes in the ftrace_ops; one type
is the filter_hash and the other is the notrace_hash. Either
one may be null, meaning it has no elements. But when elements
are added, the hash is allocated.
Throughout the code, a check needs to be made to see if a hash
exists or the hash has elements, but the check if the hash exists
is usually missing causing the possible "NULL pointer dereference bug".
Add a helper routine called "ftrace_hash_empty()" that returns
true if the hash doesn't exist or its count is zero. As they mean
the same thing.
Last-bug-reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When disabling the "notrace" records, that means we want to trace them.
If the notrace_hash is zero, it means that we want to trace all
records. But to disable a zero notrace_hash means nothing.
The check for the notrace_hash count was incorrect with:
if (hash && !hash->count)
return
With the correct comment above it that states that we do nothing
if the notrace_hash has zero count. But !hash also means that
the notrace hash has zero count. I think this was done to
protect against dereferencing NULL. But if !hash is true, then
we go through the following loop without doing a single thing.
Fix it to:
if (!hash || !hash->count)
return;
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that each set of pages in the function list are sorted by
ip, we can use bsearch to find a record within each set of pages.
This speeds up the ftrace_location() function by magnitudes.
For archs (like x86) that need to add a breakpoint at every function
that will be converted from a nop to a callback and vice versa,
the breakpoint callback needs to know if the breakpoint was for
ftrace or not. It requires finding the breakpoint ip within the
records. Doing a linear search is extremely inefficient. It is
a must to be able to do a fast binary search to find these locations.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Sort records by ip locations of the ftrace mcount calls on each of the
set of pages in the function list. This helps in localizing cache
usuage when updating the function locations, as well as gives us
the ability to quickly find an ip location in the list.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As new functions come in to be initalized from mcount to nop,
they are done by groups of pages. Whether it is the core kernel
or a module. There's no need to keep track of these on a per record
basis.
At startup, and as any module is loaded, the functions to be
traced are stored in a group of pages and added to the function
list at the end. We just need to keep a pointer to the first
page of the list that was added, and use that to know where to
start on the list for initializing functions.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allocate the mcount record pages as a group of pages as big
as can be allocated and waste no more than a single page.
Grouping the mcount pages as much as possible helps with cache
locality, as we do not need to redirect with descriptors as we
cross from page to page. It also allows us to do more with the
records later on (sort them with bigger benefits).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Records that are added to the function trace table are
permanently there, except for modules. By separating out the
modules to their own pages that can be freed in one shot
we can remove the "freed" flag and simplify some of the record
management.
Another benefit of doing this is that we can also move the
records around; sort them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The stop machine method to modify all functions in the kernel
(some 20,000 of them) is the safest way to do so across all archs.
But some archs may not need this big hammer approach to modify code
on SMP machines, and can simply just update the code it needs.
Adding a weak function arch_ftrace_update_code() that now does the
stop machine, will also let any arch override this method.
If the arch needs to check the system and then decide if it can
avoid stop machine, it can still call ftrace_run_stop_machine() to
use the old method.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Multiple users of the function tracer can register their functions
with the ftrace_ops structure. The accounting within ftrace will
update the counter on each function record that is being traced.
When the ftrace_ops filtering adds or removes functions, the
function records will be updated accordingly if the ftrace_ops is
still registered.
When a ftrace_ops is removed, the counter of the function records,
that the ftrace_ops traces, are decremented. When they reach zero
the functions that they represent are modified to stop calling the
mcount code.
When changes are made, the code is updated via stop_machine() with
a command passed to the function to tell it what to do. There is an
ENABLE and DISABLE command that tells the called function to enable
or disable the functions. But the ENABLE is really a misnomer as it
should just update the records, as records that have been enabled
and now have a count of zero should be disabled.
The DISABLE command is used to disable all functions regardless of
their counter values. This is the big off switch and is not the
complement of the ENABLE command.
To make matters worse, when a ftrace_ops is unregistered and there
is another ftrace_ops registered, neither the DISABLE nor the
ENABLE command are set when calling into the stop_machine() function
and the records will not be updated to match their counter. A command
is passed to that function that will update the mcount code to call
the registered callback directly if it is the only one left. This
means that the ftrace_ops that is still registered will have its callback
called by all functions that have been set for it as well as the ftrace_ops
that was just unregistered.
Here's a way to trigger this bug. Compile the kernel with
CONFIG_FUNCTION_PROFILER set and with CONFIG_FUNCTION_GRAPH not set:
CONFIG_FUNCTION_PROFILER=y
# CONFIG_FUNCTION_GRAPH is not set
This will force the function profiler to use the function tracer instead
of the function graph tracer.
# cd /sys/kernel/debug/tracing
# echo schedule > set_ftrace_filter
# echo function > current_tracer
# cat set_ftrace_filter
schedule
# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 692/68108025 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/0:2-909 [000] .... 531.235574: schedule <-worker_thread
<idle>-0 [001] .N.. 531.235575: schedule <-cpu_idle
kworker/0:2-909 [000] .... 531.235597: schedule <-worker_thread
sshd-2563 [001] .... 531.235647: schedule <-schedule_hrtimeout_range_clock
# echo 1 > function_profile_enabled
# echo 0 > function_porfile_enabled
# cat set_ftrace_filter
schedule
# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 159701/118821262 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [002] ...1 604.870655: local_touch_nmi <-cpu_idle
<idle>-0 [002] d..1 604.870655: enter_idle <-cpu_idle
<idle>-0 [002] d..1 604.870656: atomic_notifier_call_chain <-enter_idle
<idle>-0 [002] d..1 604.870656: __atomic_notifier_call_chain <-atomic_notifier_call_chain
The same problem could have happened with the trace_probe_ops,
but they are modified with the set_frace_filter file which does the
update at closure of the file.
The simple solution is to change ENABLE to UPDATE and call it every
time an ftrace_ops is unregistered.
Link: http://lkml.kernel.org/r/1323105776-26961-3-git-send-email-jolsa@redhat.com
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add an EXPORT_SYMBOL_GPL() so that rcutorture can dump the trace buffer
upon detection of an RCU error.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
If the set_ftrace_filter is cleared by writing just whitespace to
it, then the filter hash refcounts will be decremented but not
updated. This causes two bugs:
1) No functions will be enabled for tracing when they all should be
2) If the users clears the set_ftrace_filter twice, it will crash ftrace:
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7()
Modules linked in:
Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32
Call Trace:
[<ffffffff81051828>] warn_slowpath_common+0x83/0x9b
[<ffffffff8105185a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810ba362>] __ftrace_hash_rec_update.part.27+0x157/0x1a7
[<ffffffff810ba6e8>] ? ftrace_regex_release+0xa7/0x10f
[<ffffffff8111bdfe>] ? kfree+0xe5/0x115
[<ffffffff810ba51e>] ftrace_hash_move+0x2e/0x151
[<ffffffff810ba6fb>] ftrace_regex_release+0xba/0x10f
[<ffffffff8112e49a>] fput+0xfd/0x1c2
[<ffffffff8112b54c>] filp_close+0x6d/0x78
[<ffffffff8113a92d>] sys_dup3+0x197/0x1c1
[<ffffffff8113a9a6>] sys_dup2+0x4f/0x54
[<ffffffff8150cac2>] system_call_fastpath+0x16/0x1b
---[ end trace 77a3a7ee73794a02 ]---
Link: http://lkml.kernel.org/r/20111101141420.GA4918@debian
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A forced undef of a config value was used for testing and was
accidently left in during the final commit. This causes x86 to
run slower than needed while running function tracing as well
as causes the function graph selftest to fail when DYNMAIC_FTRACE
is not set. This is because the code in MCOUNT expects the ftrace
code to be processed with the config value set that happened to
be forced not set.
The forced config option was left in by:
commit 6331c28c96
ftrace: Fix dynamic selftest failure on some archs
Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian
Cc: stable@vger.kernel.org
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Though not all events have field 'prev_pid', it was allowed to do this:
# echo 'prev_pid == 100' > events/sched/filter
but commit 75b8e98263 (tracing/filter: Swap
entire filter of events) broke it without any reason.
Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix a bug introduced by e9dbfae5, which prevents event_subsystem from
ever being released.
Ref_count was added to keep track of subsystem users, not for counting
events. Subsystem is created with ref_count = 1, so there is no need to
increment it for every event, we have nr_events for that. Fix this by
touching ref_count only when we actually have a new user -
subsystem_open().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Link: http://lkml.kernel.org/r/1320052062-7846-1-git-send-email-idryomov@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_event_call->filter is sched RCU protected but didn't use
rcu_assign_pointer(). Use it.
TODO: Add proper __rcu annotation to call->filter and all its users.
-v2: Use RCU_INIT_POINTER() for %NULL clearing as suggested by Eric.
Link: http://lkml.kernel.org/r/20111123164949.GA29639@google.com
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: stable@kernel.org # (2.6.39+)
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Knowing the number of event entries in the ring buffer compared
to the total number that were written is useful information. The
latency format gives this information and there's no reason that the
default format does not.
This information is now added to the default header, along with the
number of online CPUs:
# tracer: nop
#
# entries-in-buffer/entries-written: 159836/64690869 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] ...2 49.442971: local_touch_nmi <-cpu_idle
<idle>-0 [000] d..2 49.442973: enter_idle <-cpu_idle
<idle>-0 [000] d..2 49.442974: atomic_notifier_call_chain <-enter_idle
<idle>-0 [000] d..2 49.442976: __atomic_notifier_call_chain <-atomic_notifier
The above shows that the trace contains 159836 entries, but
64690869 were written. One could figure out that there were
64531033 entries that were dropped.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
People keep asking how to get the preempt count, irq, and need resched info
and we keep telling them to enable the latency format. Some developers think
that traces without this info is completely useless, and for a lot of tasks
it is useless.
The first option was to enable the latency trace as the default format, but
the header for the latency format is pretty useless for most tracers and
it also does the timestamp in straight microseconds from the time the trace
started. This is sometimes more difficult to read as the default trace is
seconds from the start of boot up.
Latency format:
# tracer: nop
#
# nop latency trace v1.1.5 on 3.2.0-rc1-test+
# --------------------------------------------------------------------
# latency: 0 us, #159771/64234230, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
migratio-6 0...2 41778231us+: rcu_note_context_switch <-__schedule
migratio-6 0...2 41778233us : trace_rcu_utilization <-rcu_note_context_switch
migratio-6 0...2 41778235us+: rcu_sched_qs <-rcu_note_context_switch
migratio-6 0d..2 41778236us+: rcu_preempt_qs <-rcu_note_context_switch
migratio-6 0...2 41778238us : trace_rcu_utilization <-rcu_note_context_switch
migratio-6 0...2 41778239us+: debug_lockdep_rcu_enabled <-__schedule
default format:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
migration/0-6 [000] 50.025810: rcu_note_context_switch <-__schedule
migration/0-6 [000] 50.025812: trace_rcu_utilization <-rcu_note_context_switch
migration/0-6 [000] 50.025813: rcu_sched_qs <-rcu_note_context_switch
migration/0-6 [000] 50.025815: rcu_preempt_qs <-rcu_note_context_switch
migration/0-6 [000] 50.025817: trace_rcu_utilization <-rcu_note_context_switch
migration/0-6 [000] 50.025818: debug_lockdep_rcu_enabled <-__schedule
migration/0-6 [000] 50.025820: debug_lockdep_rcu_enabled <-__schedule
The latency format header has latency information that is pretty meaningless
for most tracers. Although some of the header is useful, and we can add that
later to the default format as well.
What is really useful with the latency format is the irqs-off, need-resched
hard/softirq context and the preempt count.
This commit adds the option irq-info which is on by default that adds this
information:
# tracer: nop
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] d..2 49.309305: cpuidle_get_driver <-cpuidle_idle_call
<idle>-0 [000] d..2 49.309307: mwait_idle <-cpu_idle
<idle>-0 [000] d..2 49.309309: need_resched <-mwait_idle
<idle>-0 [000] d..2 49.309310: test_ti_thread_flag <-need_resched
<idle>-0 [000] d..2 49.309312: trace_power_start.constprop.13 <-mwait_idle
<idle>-0 [000] d..2 49.309313: trace_cpu_idle <-mwait_idle
<idle>-0 [000] d..2 49.309315: need_resched <-mwait_idle
If a user wants the old format, they can disable the 'irq-info' option:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [000] 49.309305: cpuidle_get_driver <-cpuidle_idle_call
<idle>-0 [000] 49.309307: mwait_idle <-cpu_idle
<idle>-0 [000] 49.309309: need_resched <-mwait_idle
<idle>-0 [000] 49.309310: test_ti_thread_flag <-need_resched
<idle>-0 [000] 49.309312: trace_power_start.constprop.13 <-mwait_idle
<idle>-0 [000] 49.309313: trace_cpu_idle <-mwait_idle
<idle>-0 [000] 49.309315: need_resched <-mwait_idle
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the set_ftrace_filter is cleared by writing just whitespace to
it, then the filter hash refcounts will be decremented but not
updated. This causes two bugs:
1) No functions will be enabled for tracing when they all should be
2) If the users clears the set_ftrace_filter twice, it will crash ftrace:
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7()
Modules linked in:
Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32
Call Trace:
[<ffffffff81051828>] warn_slowpath_common+0x83/0x9b
[<ffffffff8105185a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810ba362>] __ftrace_hash_rec_update.part.27+0x157/0x1a7
[<ffffffff810ba6e8>] ? ftrace_regex_release+0xa7/0x10f
[<ffffffff8111bdfe>] ? kfree+0xe5/0x115
[<ffffffff810ba51e>] ftrace_hash_move+0x2e/0x151
[<ffffffff810ba6fb>] ftrace_regex_release+0xba/0x10f
[<ffffffff8112e49a>] fput+0xfd/0x1c2
[<ffffffff8112b54c>] filp_close+0x6d/0x78
[<ffffffff8113a92d>] sys_dup3+0x197/0x1c1
[<ffffffff8113a9a6>] sys_dup2+0x4f/0x54
[<ffffffff8150cac2>] system_call_fastpath+0x16/0x1b
---[ end trace 77a3a7ee73794a02 ]---
Link: http://lkml.kernel.org/r/20111101141420.GA4918@debian
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A forced undef of a config value was used for testing and was
accidently left in during the final commit. This causes x86 to
run slower than needed while running function tracing as well
as causes the function graph selftest to fail when DYNMAIC_FTRACE
is not set. This is because the code in MCOUNT expects the ftrace
code to be processed with the config value set that happened to
be forced not set.
The forced config option was left in by:
commit 6331c28c96
ftrace: Fix dynamic selftest failure on some archs
Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian
Cc: stable@vger.kernel.org
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The system filter can be used to set multiple event filters that
exist within the system. But currently it displays the last filter
written that does not necessarily correspond to the filters within
the system. The system filter itself is not used to filter any events.
The system filter is just a means to set filters of the events within
it.
Because this causes an ambiguous state when the system filter reads
a filter string but the events within the system have different strings
it is best to just show a boiler plate:
### global filter ###
# Use this to set filters for multiple events.
# Only events with the given fields will be affected.
# If no events are modified, an error message will be displayed here.
If an error occurs while writing to the system filter, the system
filter will replace the boiler plate with the error message as it
currently does.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Though not all events have field 'prev_pid', it was allowed to do this:
# echo 'prev_pid == 100' > events/sched/filter
but commit 75b8e98263 (tracing/filter: Swap
entire filter of events) broke it without any reason.
Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These files were getting <linux/module.h> via an implicit non-obvious
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Fix a bug introduced by e9dbfae5, which prevents event_subsystem from
ever being released.
Ref_count was added to keep track of subsystem users, not for counting
events. Subsystem is created with ref_count = 1, so there is no need to
increment it for every event, we have nr_events for that. Fix this by
touching ref_count only when we actually have a new user -
subsystem_open().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Link: http://lkml.kernel.org/r/1320052062-7846-1-git-send-email-idryomov@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These files are doing things like module_put and try_module_get
so they need to call out the module.h for explicit inclusion,
rather than getting it via <linux/device.h> which we ideally want
to remove the module.h inclusion from.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
perf symbols: Increase symbol KSYM_NAME_LEN size
perf hists browser: Refuse 'a' hotkey on non symbolic views
perf ui browser: Use libslang to read keys
perf tools: Fix tracing info recording
perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
perf hists: Don't consider filtered entries when calculating column widths
perf hists: Don't decay total_period for filtered entries
perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
perf hists browser: Do not exit on tab key with single event
perf annotate browser: Don't change selection line when returning from callq
perf tools: handle endianness of feature bitmap
perf tools: Add prelink suggestion to dso update message
perf script: Fix unknown feature comment
perf hists browser: Apply the dso and thread filters when merging new batches
perf hists: Move the dso and thread filters from hist_browser
perf ui browser: Honour the xterm colors
perf top tui: Give color hints just on the percentage, like on --stdio
perf ui browser: Make the colors configurable and change the defaults
perf tui: Remove unneeded call to newtCls on startup
perf hists: Don't format the percentage on hist_entry__snprintf
...
Fix up conflicts in arch/x86/kernel/kprobes.c manually.
Ingo's tree did the insane "add volatile to const array", which just
doesn't make sense ("volatile const"?). But we could remove the const
*and* make the array volatile to make doubly sure that gcc doesn't
optimize it away..
Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
reader_lock has been turned into a raw lock by the core locking merge,
and there was a new user of it introduced in this perf core merge. Make
sure that new use also uses the raw accessor functions.
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
lockdep: Comment all warnings
lib: atomic64: Change the type of local lock to raw_spinlock_t
locking, lib/atomic64: Annotate atomic64_lock::lock as raw
locking, x86, iommu: Annotate qi->q_lock as raw
locking, x86, iommu: Annotate irq_2_ir_lock as raw
locking, x86, iommu: Annotate iommu->register_lock as raw
locking, dma, ipu: Annotate bank_lock as raw
locking, ARM: Annotate low level hw locks as raw
locking, drivers/dca: Annotate dca_lock as raw
locking, powerpc: Annotate uic->lock as raw
locking, x86: mce: Annotate cmci_discover_lock as raw
locking, ACPI: Annotate c3_lock as raw
locking, oprofile: Annotate oprofilefs lock as raw
locking, video: Annotate vga console lock as raw
locking, latencytop: Annotate latency_lock as raw
locking, timer_stats: Annotate table_lock as raw
locking, rwsem: Annotate inner lock as raw
locking, semaphores: Annotate inner lock as raw
locking, sched: Annotate thread_group_cputimer as raw
...
Fix up conflicts in kernel/posix-cpu-timers.c manually: making
cputimer->cputime a raw lock conflicted with the ABBA fix in commit
bcd5cff721 ("cputimer: Cure lock inversion").
The trace_pipe_raw handler holds a cached page from the time the file
is opened to the time it is closed. The cached page is used to handle
the case of the user space buffer being smaller than what was read from
the ring buffer. The left over buffer is held in the cache so that the
next read will continue where the data left off.
After EOF is returned (no more data in the buffer), the index of
the cached page is set to zero. If a user app reads the page again
after EOF, the check in the buffer will see that the cached page
is less than page size and will return the cached page again. This
will cause reading the trace_pipe_raw again after EOF to return
duplicate data, making the output look like the time went backwards
but instead data is just repeated.
The fix is to not reset the index right after all data is read
from the cache, but to reset it after all data is read and more
data exists in the ring buffer.
Cc: stable <stable@kernel.org>
Reported-by: Jeremy Eder <jeder@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_enabled option is deprecated.
To start/stop tracing, write to /sys/kernel/debug/tracing/tracing_on
without tracing_enabled. This patch is based on Linux 3.1.0-rc1
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Link: http://lkml.kernel.org/r/1313127022-23830-1-git-send-email-leemgs1@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When doing intense tracing, the kmalloc inside trace_marker can
introduce side effects to what is being traced.
As trace_marker() is used by userspace to inject data into the
kernel ring buffer, it needs to do so with the least amount
of intrusion to the operations of the kernel or the user space
application.
As the ring buffer is designed to write directly into the buffer
without the need to make a temporary buffer, and userspace already
went through the hassle of knowing how big the write will be,
we can simply pin the userspace pages and write the data directly
into the buffer. This improves the impact of tracing via trace_marker
tremendously!
Thanks to Peter Zijlstra and Thomas Gleixner for pointing out the
use of get_user_pages_fast() and kmap_atomic().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the function tracer is very intrusive, lots of self checks are
performed on the tracer and if something is found to be strange
it will shut itself down keeping it from corrupting the rest of the
kernel. This shutdown may still allow functions to be traced, as the
tracing only stops new modifications from happening. Trying to stop
the function tracer itself can cause more harm as it requires code
modification.
Although a WARN_ON() is executed, a user may not notice it. To help
the user see that something isn't right with the tracing of the system
a big warning is added to the output of the tracer that lets the user
know that their data may be incomplete.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix kprobe-tracer not to delete a probe if the probe is in use.
In that case, delete operation will return -EBUSY.
This bug can cause a kernel panic if enabled probes are deleted
during perf record.
(Add some probes on functions)
sh-4.2# perf probe --del probe:\*
sh-4.2# exit
(kernel panic)
This is originally reported on the fedora bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=742383
I've also checked that this problem doesn't happen on
tracepoints when module removing because perf event
locks target module.
$ sudo ./perf record -e xfs:\* -aR sh
sh-4.2# rmmod xfs
ERROR: Module xfs is in use
sh-4.2# exit
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.203 MB perf.data (~8862 samples) ]
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/20111004104438.14591.6553.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* pm-runtime:
PM / Tracing: build rpm-traces.c only if CONFIG_PM_RUNTIME is set
PM / Runtime: Replace dev_dbg() with trace_rpm_*()
PM / Runtime: Introduce trace points for tracing rpm_* functions
PM / Runtime: Don't run callbacks under lock for power.irq_safe set
USB: Add wakeup info to debugging messages
PM / Runtime: pm_runtime_idle() can be called in atomic context
PM / Runtime: Add macro to test for runtime PM events
PM / Runtime: Add might_sleep() to runtime PM functions
Do not build kernel/trace/rpm-traces.c if CONFIG_PM_RUNTIME is not
set, which avoids a build failure.
[rjw: Added the changelog and modified the subject slightly.]
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
This patch introduces 3 trace points to prepare for tracing
rpm_idle/rpm_suspend/rpm_resume functions, so we can use these
trace points to replace the current dev_dbg().
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If irqs are disabled when preemption count reaches zero, the
preemptirqsoff tracer should not flag that as the end.
When interrupts are enabled and preemption count is not zero
the preemptirqsoff correctly continues its tracing.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When debugging tight race conditions, it can be helpful to have a
synchronized tracing method. Although in most cases the global clock
provides this functionality, if timings is not the issue, it is more
comforting to know that the order of events really happened in a precise
order.
Instead of using a clock, add a "counter" that is simply an incrementing
atomic 64bit counter that orders the events as they are perceived to
happen.
The trace_clock_counter() is added from the attempt by Peter Zijlstra
trying to convert the trace_clock_global() to it. I took Peter's counter
code and made trace_clock_counter() instead, and added it to the choice
of clocks. Just echo counter > /debug/tracing/trace_clock to activate
it.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Requested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing locks can be taken in atomic context and therefore
cannot be preempted on -rt - annotate it.
In mainline this change documents the low level nature of
the lock - otherwise there's no functional difference. Lockdep
and Sparse checking will work as usual.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The stats file under per_cpu folder provides the number of entries,
overruns and other statistics about the CPU ring buffer. However, the
numbers do not provide any indication of how full the ring buffer is in
bytes compared to the overall size in bytes. Also, it is helpful to know
the rate at which the cpu buffer is filling up.
This patch adds an entry "bytes: " in printed stats for per_cpu ring
buffer which provides the actual bytes consumed in the ring buffer. This
field includes the number of bytes used by recorded events and the
padding bytes added when moving the tail pointer to next page.
It also adds the following time stamps:
"oldest event ts:" - the oldest timestamp in the ring buffer
"now ts:" - the timestamp at the time of reading
The field "now ts" provides a consistent time snapshot to the userspace
when being read. This is read from the same trace clock used by tracing
event timestamps.
Together, these values provide the rate at which the buffer is filling
up, from the formula:
bytes / (now_ts - oldest_event_ts)
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1313531179-9323-3-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current file "buffer_size_kb" reports the size of per-cpu buffer and
not the overall memory allocated which could be misleading. A new file
"buffer_total_size_kb" adds up all the enabled CPU buffer sizes and
reports it. This is only a readonly entry.
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1313531179-9323-2-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The self testing for event filters does not really need preemption
disabled as there are no races at the time of testing, but the functions
it calls uses rcu_dereference_sched() which will complain if preemption
is not disabled.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding automated tests running as late_initcall. Tests are
compiled in with CONFIG_FTRACE_STARTUP_TEST option.
Adding test event "ftrace_test_filter" used to simulate
filter processing during event occurance.
String filters are compiled and tested against several
test events with different values.
Also testing that evaluation of explicit predicates is ommited
due to the lazy filter evaluation.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-11-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding walk_pred_tree function to be used for walking throught
the filter predicates.
For each predicate the callback function is called, allowing
users to add their own functionality or customize their way
through the filter predicates.
Changing check_pred_tree function to use walk_pred_tree.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-6-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We dont need to perform lookup through the ftrace_events list,
instead we can use the 'tp_event' field.
Each perf_event contains tracepoint event field 'tp_event', which
got initialized during the tracepoint event initialization.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-5-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The field_name was used just for finding event's fields. This way we
don't need to care about field_name allocation/free.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-4-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Making the code cleaner by having one function to fully prepare
the predicate (create_pred), and another to add the predicate to
the filter (filter_add_pred).
As a benefit, this way the dry_run flag stays only inside the
replace_preds function and is not passed deeper.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-3-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Don't dynamically allocate filter_pred struct, use static memory.
This way we can get rid of the code managing the dynamic filter_pred
struct object.
The create_pred function integrates create_logical_pred function.
This way the static predicate memory is returned only from
one place.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-2-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...
Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
FUA follows WRITE, use the same 'F' flag for both cases and
distinguish them by their (relative) position. The end results
look like (other flags might be shown also):
- WRITE: W
- WRITE_FLUSH: FW
- WRITE_FUA: WF
- WRITE_FLUSH_FUA: FWF
Note that we reuse TC_BARRIER due to lack of bit space of act_mask
so that the older versions of blktrace tools will report flush
requests as barriers from now on.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
gcc incorrectly states that the variable "fmt" is uninitialized when
CC_OPITMIZE_FOR_SIZE is set.
Instead of just blindly setting fmt to NULL, the code is cleaned up
a little to be a bit easier for humans to follow, as well as gcc
to know the variables are initialized.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
What was scheduled to be 2.6.41 is now going to be 3.1 .
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1107250929370.8080@swampdragon.chaosbits.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the address of a module-local variable can only be
solved after the target module is loaded, the symbol
fetch-argument should be updated when loading target
module.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20110627072703.6528.75042.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To support probing module init functions, kprobe-tracer allows
user to define a probe on non-existed function when it is given
with a module name. This also enables user to set a probe on
a function on a specific module, even if a same name (but different)
function is locally defined in another module.
The module name must be in the front of function name and separated
by a ':'. e.g. btrfs:btrfs_init_sysfs
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20110627072656.6528.89970.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enabling function tracer to trace all functions, then load a module and
then disable function tracing will cause ftrace to fail.
This can also happen by enabling function tracing on the command line:
ftrace=function
and during boot up, modules are loaded, then you disable function tracing
with 'echo nop > current_tracer' you will trigger a bug in ftrace that
will shut itself down.
The reason is, the new ftrace code keeps ref counts of all ftrace_ops that
are registered for tracing. When one or more ftrace_ops are registered,
all the records that represent the functions that the ftrace_ops will
trace have a ref count incremented. If this ref count is not zero,
when the code modification runs, that function will be enabled for tracing.
If the ref count is zero, that function will be disabled from tracing.
To make sure the accounting was working, FTRACE_WARN_ON()s were added
to updating of the ref counts.
If the ref count hits its max (> 2^30 ftrace_ops added), or if
the ref count goes below zero, a FTRACE_WARN_ON() is triggered which
disables all modification of code.
Since it is common for ftrace_ops to trace all functions in the kernel,
instead of creating > 20,000 hash items for the ftrace_ops, the hash
count is just set to zero, and it represents that the ftrace_ops is
to trace all functions. This is where the issues arrise.
If you enable function tracing to trace all functions, and then add
a module, the modules function records do not get the ref count updated.
When the function tracer is disabled, all function records ref counts
are subtracted. Since the modules never had their ref counts incremented,
they go below zero and the FTRACE_WARN_ON() is triggered.
The solution to this is rather simple. When modules are loaded, and
their functions are added to the the ftrace pool, look to see if any
ftrace_ops are registered that trace all functions. And for those,
update the ref count for the module function records.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rename probe_* to trace_probe_* for avoiding namespace
confliction. This also fixes improper names of find_probe_event()
and cleanup_all_probes() to find_trace_probe() and
release_all_trace_probes() respectively.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20110627072636.6528.60374.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the stack trace per event in ftace is only 8 frames.
This can be quite limiting and sometimes useless. Especially when
the "ignore frames" is wrong and we also use up stack frames for
the event processing itself.
Change this to be dynamic by adding a percpu buffer that we can
write a large stack frame into and then copy into the ring buffer.
For interrupts and NMIs that come in while another event is being
process, will only get to use the 8 frame stack. That should be enough
as the task that it interrupted will have the full stack frame anyway.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Archs that do not implement CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, will
fail the dynamic ftrace selftest.
The function tracer has a quick 'off' variable that will prevent
the call back functions from being called. This variable is called
function_trace_stop. In x86, this is implemented directly in the mcount
assembly, but for other archs, an intermediate function is used called
ftrace_test_stop_func().
In dynamic ftrace, the function pointer variable ftrace_trace_function is
used to update the caller code in the mcount caller. But for archs that
do not have CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST set, it only calls
ftrace_test_stop_func() instead, which in turn calls __ftrace_trace_function.
When more than one ftrace_ops is registered, the function it calls is
ftrace_ops_list_func(), which will iterate over all registered ftrace_ops
and call the callbacks that have their hash matching.
The issue happens when two ftrace_ops are registered for different functions
and one is then unregistered. The __ftrace_trace_function is then pointed
to the remaining ftrace_ops callback function directly. This mean it will
be called for all functions that were registered to trace by both ftrace_ops
that were registered.
This is not an issue for archs with CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST,
because the update of ftrace_trace_function doesn't happen until after all
functions have been updated, and then the mcount caller is updated. But
for those archs that do use the ftrace_test_stop_func(), the update is
immediate.
The dynamic selftest fails because it hits this situation, and the
ftrace_ops that it registers fails to only trace what it was suppose to
and instead traces all other functions.
The solution is to delay the setting of __ftrace_trace_function until
after all the functions have been updated according to the registered
ftrace_ops. Also, function_trace_stop is set during the update to prevent
function tracing from calling code that is caused by the function tracer
itself.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, if set_ftrace_filter() is called when the ftrace_ops is
active, the function filters will not be updated. They will only be updated
when tracing is disabled and re-enabled.
Update the functions immediately during set_ftrace_filter().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Whenever the hash of the ftrace_ops is updated, the record counts
must be balance. This requires disabling the records that are set
in the original hash, and then enabling the records that are set
in the updated hash.
Moving the update into ftrace_hash_move() removes the bug where the
hash was updated but the records were not, which results in ftrace
triggering a warning and disabling itself because the ftrace_ops filter
is updated while the ftrace_ops was registered, and then the failure
happens when the ftrace_ops is unregistered.
The current code will not trigger this bug, but new code will.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When I mounted an NFS directory, it caused several modules to be loaded. At the
time I was running the preemptirqsoff tracer, and it showed the following
output:
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 2.6.33.9-rt30-mrg-test
# --------------------------------------------------------------------
# latency: 1177 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: modprobe-19370 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: ftrace_module_notify
# => ended at: ftrace_module_notify
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| /_--=> lock-depth
# |||||/ delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
modprobe-19370 3d.... 0us!: ftrace_process_locs <-ftrace_module_notify
modprobe-19370 3d.... 1176us : ftrace_process_locs <-ftrace_module_notify
modprobe-19370 3d.... 1178us : trace_hardirqs_on <-ftrace_module_notify
modprobe-19370 3d.... 1178us : <stack trace>
=> ftrace_process_locs
=> ftrace_module_notify
=> notifier_call_chain
=> __blocking_notifier_call_chain
=> blocking_notifier_call_chain
=> sys_init_module
=> system_call_fastpath
That's over 1ms that interrupts are disabled on a Real-Time kernel!
Looking at the cause (being the ftrace author helped), I found that the
interrupts are disabled before the code modification of mcounts into nops. The
interrupts only need to be disabled on start up around this code, not when
modules are being loaded.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a function is set to be traced by the set_graph_function, but the
option funcgraph-irqs is zero, and the traced function happens to be
called from a interrupt, it will not be traced.
The point of funcgraph-irqs is to not trace interrupts when we are
preempted by an irq, not to not trace functions we want to trace that
happen to be *in* a irq.
Luckily the current->trace_recursion element is perfect to add a flag
to help us be able to trace functions within an interrupt even when
we are not tracing interrupts that preempt the trace.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "enable" file for the event system can be removed when a module
is unloaded and the event system only has events from that module.
As the event system nr_events count goes to zero, it may be freed
if its ref_count is also set to zero.
Like the "filter" file, the "enable" file may be opened by a task and
referenced later, after a module has been unloaded and the events for
that event system have been removed.
Although the "filter" file referenced the event system structure,
the "enable" file only references a pointer to the event system
name. Since the name is freed when the event system is removed,
it is possible that an access to the "enable" file may reference
a freed pointer.
Update the "enable" file to use the subsystem_open() routine that
the "filter" file uses, to keep a reference to the event system
structure while the "enable" file is opened.
Cc: <stable@kernel.org>
Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event system is freed when its nr_events is set to zero. This happens
when a module created an event system and then later the module is
removed. Modules may share systems, so the system is allocated when
it is created and freed when the modules are unloaded and all the
events under the system are removed (nr_events set to zero).
The problem arises when a task opened the "filter" file for the
system. If the module is unloaded and it removed the last event for
that system, the system structure is freed. If the task that opened
the filter file accesses the "filter" file after the system has
been freed, the system will access an invalid pointer.
By adding a ref_count, and using it to keep track of what
is using the event system, we can free it after all users
are finished with the event system.
Cc: <stable@kernel.org>
Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix to support kernel stack trace correctly on kprobe-tracer.
Since the execution path of kprobe-based dynamic events is different
from other tracepoint-based events, normal ftrace_trace_stack() doesn't
work correctly. To fix that, this introduces ftrace_trace_stack_regs()
which traces stack via pt_regs instead of current stack register.
e.g.
# echo p schedule+4 > /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/options/stacktrace
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
# head -n 20 /sys/kernel/debug/tracing/trace
bash-2968 [000] 10297.050245: p_schedule_4: (schedule+0x4/0x4ca)
bash-2968 [000] 10297.050247: <stack trace>
=> schedule_timeout
=> n_tty_read
=> tty_read
=> vfs_read
=> sys_read
=> system_call_fastpath
kworker/0:1-2940 [000] 10297.050265: p_schedule_4: (schedule+0x4/0x4ca)
kworker/0:1-2940 [000] 10297.050266: <stack trace>
=> worker_thread
=> kthread
=> kernel_thread_helper
sshd-1132 [000] 10297.050365: p_schedule_4: (schedule+0x4/0x4ca)
sshd-1132 [000] 10297.050365: <stack trace>
=> sysret_careful
Note: Even with this fix, the first entry will be skipped
if the probe is put on the function entry area before
the frame pointer is set up (usually, that is 4 bytes
(push %bp; mov %sp %bp) on x86), because stack unwinder
depends on the frame pointer.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Namhyung Kim <namhyung@gmail.com>
Link: http://lkml.kernel.org/r/20110608070934.17777.17116.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing ring buffer is allocated from kernel memory. While
allocating a large chunk of memory, OOM might happen which destabilizes
the system. Thus random processes might get killed during the
allocation.
This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
to make it fail more gracefully if the system will not be able to
complete the allocation request.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1307491302-9236-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch replaces the code for getting an unsigned long from a
userspace buffer by a simple call to kstroul_from_user.
This makes it easier to read and less error prone.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Link: http://lkml.kernel.org/r/1307476707-14762-1-git-send-email-peterhuewe@gmx.de
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function_graph tracer does not follow global context-info option.
Adding TRACE_ITER_CONTEXT_INFO trace_flags check to enable it.
With following commands:
# echo function_graph > ./current_tracer
# echo 0 > options/context-info
# cat trace
This is what it looked like before:
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
1) 0.079 us | } /* __vma_link_rb */
1) 0.056 us | copy_page_range();
1) | security_vm_enough_memory() {
...
This is what it looks like now:
# tracer: function_graph
#
} /* update_ts_time_stats */
timekeeping_max_deferment();
...
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-6-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The header display of function tracer does not follow
the context-info option, so field names are displayed even
if this option is off.
Added check for TRACE_ITER_CONTEXT_INFO trace_flags.
With following commands:
# echo function > ./current_tracer
# echo 0 > options/context-info
# cat trace
This is what it looked like before:
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
add_preempt_count <-schedule
rcu_note_context_switch <-schedule
...
This is what it looks like now:
# tracer: function
#
_raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
...
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-4-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Functions print_graph_overhead() and print_graph_duration() displays
data for one field - DURATION.
I merged them into single function print_graph_duration(),
and added a way to display the empty parts of the field.
This way the print_graph_irq() function can use this column to display
the IRQ signs if needed and the DURATION field details stays inside
the print_graph_duration() function.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-3-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The display of absolute time and duration fields is based on the
latency field. This was added during the irqsoff/wakeup tracers
graph support changes.
It's causing confusion in what fields will be displayed for the
function_graph tracer itself. So I'm removing this depency, and
adding absolute time and duration fields to the preemptirqsoff
preemptoff irqsoff wakeup tracers.
With following commands:
# echo function_graph > ./current_tracer
# cat trace
This is what it looked like before:
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
0) 0.068 us | } /* page_add_file_rmap */
0) | _raw_spin_unlock() {
...
This is what it looks like now:
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) 0.068 us | } /* add_preempt_count */
0) 0.993 us | } /* vfsmount_lock_local_lock */
...
For preemptirqsoff preemptoff irqsoff wakeup tracers,
this is what it looked like before:
SNIP
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> lock-depth
# |||| /
# CPU TASK/PID ||||| DURATION FUNCTION CALLS
# | | | ||||| | | | | | |
1) <idle>-0 | d..1 0.000 us | acpi_idle_enter_simple();
...
This is what it looks like now:
SNIP
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| /
# TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
19.847735 | 1) <idle>-0 | d..1 0.000 us | acpi_idle_enter_simple();
...
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-2-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a trace option to disable tracing on free. When this option is
set, a write into the free_buffer file will not only shrink the
ring buffer down to zero, but it will also disable tracing.
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The proc file entry buffer_size_kb is used to set the size of tracing
buffer. The memory to expand the buffer size is kernel memory. Consider
a use case where tracing is handled by a user space utility, which acts
as a gate keeper for tracing requests. In an OOM condition, tracing is
considered a low priority task and if the utility gets killed the ring
buffer memory cannot be released back to the kernel.
This patch adds a proc file called "free_buffer" whose purpose is to
stop tracing and free up the ring buffer when it is closed.
The user space process can then set the desired size in buffer_size_kb
file and open the fd to the "free_buffer" file. Under OOM condition, if
the process gets killed, the kernel closes the file descriptor. The
release handler stops the tracing and releases the kernel memory
automatically.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/1308012717-11148-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing ring buffer is a group of per-cpu ring buffers where
allocation and logging is done on a per-cpu basis. The events that are
generated on a particular CPU are logged in the corresponding buffer.
This is to provide wait-free writes between CPUs and good NUMA node
locality while accessing the ring buffer.
However, the allocation routines consider NUMA locality only for buffer
page metadata and not for the actual buffer page. This causes the pages
to be allocated on the NUMA node local to the CPU where the allocation
routine is running at the time.
This patch fixes the problem by using a NUMA node specific allocation
routine so that the pages are allocated from a NUMA node local to the
logging CPU.
I tested with the getuid_microbench from autotest. It is a simple binary
that calls getuid() in a loop and measures the average time for the
syscall to complete. The following command was used to test:
$ getuid_microbench 1000000
Compared the numbers found on kernel with and without this patch and
found that logging latency decreases by 30-50 ns/call.
tracing with non-NUMA allocation - 569 ns/call
tracing with NUMA allocation - 512 ns/call
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1304470602-20366-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In using syscall tracing by concurrent processes, the wakeup() that is
called in the event commit function causes contention on the spin lock
of the waitqueue. I enabled sys_enter_getuid and sys_exit_getuid
tracepoints, and by running getuid_microbench from autotest in parallel
I found that the contention causes exponential latency increase in the
tracing path.
The autotest binary getuid_microbench calls getuid() in a tight loop for
the given number of iterations and measures the average time required to
complete a single invocation of syscall.
The patch schedules a delayed work after 2 ms once an event commit calls
to wake up the trace wait_queue. This removes the delay caused by
contention on spin lock in wakeup() and amortizes the wakeup() calls
scheduled over the 2 ms period.
In the following example, the script enables the sys_enter_getuid and
sys_exit_getuid tracepoints and runs the getuid_microbench in parallel
with the given number of processes. The output clearly shows the latency
increase caused by contentions.
$ ~/getuid.sh 1
1000000 calls in 0.720974253 s (720.974253 ns/call)
$ ~/getuid.sh 2
1000000 calls in 1.166457554 s (1166.457554 ns/call)
1000000 calls in 1.168933765 s (1168.933765 ns/call)
$ ~/getuid.sh 3
1000000 calls in 1.783827516 s (1783.827516 ns/call)
1000000 calls in 1.795553270 s (1795.553270 ns/call)
1000000 calls in 1.796493376 s (1796.493376 ns/call)
$ ~/getuid.sh 4
1000000 calls in 4.483041796 s (4483.041796 ns/call)
1000000 calls in 4.484165388 s (4484.165388 ns/call)
1000000 calls in 4.484850762 s (4484.850762 ns/call)
1000000 calls in 4.485643576 s (4485.643576 ns/call)
$ ~/getuid.sh 5
1000000 calls in 6.497521653 s (6497.521653 ns/call)
1000000 calls in 6.502000236 s (6502.000236 ns/call)
1000000 calls in 6.501709115 s (6501.709115 ns/call)
1000000 calls in 6.502124100 s (6502.124100 ns/call)
1000000 calls in 6.502936358 s (6502.936358 ns/call)
After the patch, the latencies scale better.
1000000 calls in 0.728720455 s (728.720455 ns/call)
1000000 calls in 0.842782857 s (842.782857 ns/call)
1000000 calls in 0.883803135 s (883.803135 ns/call)
1000000 calls in 0.902077764 s (902.077764 ns/call)
1000000 calls in 0.902838202 s (902.838202 ns/call)
1000000 calls in 0.908896885 s (908.896885 ns/call)
1000000 calls in 0.932523515 s (932.523515 ns/call)
1000000 calls in 0.958009672 s (958.009672 ns/call)
1000000 calls in 0.986188020 s (986.188020 ns/call)
1000000 calls in 0.989771102 s (989.771102 ns/call)
1000000 calls in 0.933518391 s (933.518391 ns/call)
1000000 calls in 0.958897947 s (958.897947 ns/call)
1000000 calls in 1.031038897 s (1031.038897 ns/call)
1000000 calls in 1.089516025 s (1089.516025 ns/call)
1000000 calls in 1.141998347 s (1141.998347 ns/call)
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1305059241-7629-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The fix to fix the printk_formats of modules broke the
printk_formats of trace_printks in the kernel.
The update of what to show via the seq_file was only updated
if the passed in fmt was NULL, which happens only on the first
iteration. The result was showing the first format every time
instead of iterating through the available formats.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Revert the commit that removed the disabling of interrupts around
the initial modifying of mcount callers to nops, and update the comment.
The original comment was outdated and stated that the interrupts were
being disabled to prevent kstop machine, which was required with the
old ftrace daemon, but was no longer the case.
What the comment failed to mention was that interrupts needed to be
disabled to keep interrupts from preempting the modifying of the code
and then executing the code that was partially modified.
Revert the commit and update the comment.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With gcc 4.6, the self test kprobe function:
kprobe_trace_selftest_target()
is optimized such that kallsyms does not list it. The kprobes
test uses this function to insert a probe and test it. But
it will fail the test if the function is not listed in kallsyms.
Adding a __used annotation keeps the symbol in the kallsyms table.
Suggested-by: David Daney <ddaney@caviumnetworks.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kernel/trace/ftrace.c: In function 'ftrace_regex_write.clone.15':
kernel/trace/ftrace.c:2743:6: warning: 'ret' may be used uninitialized in this
function
Signed-off-by: GuoWen Li <guowen.li.linux@gmail.com>
Link: http://lkml.kernel.org/r/201106011918.47939.guowen.li.linux@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Witold reported a reboot caused by the selftests of the dynamic function
tracer. He sent me a config and I used ktest to do a config_bisect on it
(as my config did not cause the crash). It pointed out that the problem
config was CONFIG_PROVE_RCU.
What happened was that if multiple callbacks are attached to the
function tracer, we iterate a list of callbacks. Because the list is
managed by synchronize_sched() and preempt_disable, the access to the
pointers uses rcu_dereference_raw().
When PROVE_RCU is enabled, the rcu_dereference_raw() calls some
debugging functions, which happen to be traced. The tracing of the debug
function would then call rcu_dereference_raw() which would then call the
debug function and then... well you get the idea.
I first wrote two different patches to solve this bug.
1) add a __rcu_dereference_raw() that would not do any checks.
2) add notrace to the offending debug functions.
Both of these patches worked.
Talking with Paul McKenney on IRC, he suggested to add recursion
detection instead. This seemed to be a better solution, so I decided to
implement it. As the task_struct already has a trace_recursion to detect
recursion in the ring buffer, and that has a very small number it
allows, I decided to use that same variable to add flags that can detect
the recursion inside the infrastructure of the function tracer.
I plan to change it so that the task struct bit can be checked in
mcount, but as that requires changes to all archs, I will hold that off
to the next merge window.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Filesystem, like Btrfs, has some "ULL" macros, and when these macros are passed
to tracepoints'__print_symbolic(), there will be 64->32 truncate WARNINGS during
compiling on 32bit box.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/4DACE6E0.7000507@cn.fujitsu.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When dynamic ftrace is not configured, the ops->flags still needs
to have its FTRACE_OPS_FL_ENABLED bit set in ftrace_startup().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The self tests for event tracer does not check if the function
tracing was successfully activated. It needs to before it continues
the tests, otherwise the wrong errors may be reported.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The register_ftrace_function() returns an error code on failure
except if the call to ftrace_startup() fails. Add a error return to
ftrace_startup() if it fails to start, allowing register_ftrace_funtion()
to return a proper error value.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits)
sched: Fix and optimise calculation of the weight-inverse
sched: Avoid going ahead if ->cpus_allowed is not changed
sched, rt: Update rq clock when unthrottling of an otherwise idle CPU
sched: Remove unused parameters from sched_fork() and wake_up_new_task()
sched: Shorten the construction of the span cpu mask of sched domain
sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG
sched: Remove unused 'this_best_prio arg' from balance_tasks()
sched: Remove noop in alloc_rt_sched_group()
sched: Get rid of lock_depth
sched: Remove obsolete comment from scheduler_tick()
sched: Fix sched_domain iterations vs. RCU
sched: Next buddy hint on sleep and preempt path
sched: Make set_*_buddy() work on non-task entities
sched: Remove need_migrate_task()
sched: Move the second half of ttwu() to the remote cpu
sched: Restructure ttwu() some more
sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
sched: Remove rq argument from ttwu_stat()
sched: Remove rq->lock from the first half of ttwu()
sched: Drop rq->lock from sched_exec()
...
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix rt_rq runtime leakage bug
Since users of the function tracer can now pick and choose which
functions they want to trace agnostically from other users of the
function tracer, we need to pass the ops struct to the ftrace_set_filter()
functions.
The functions ftrace_set_global_filter() and ftrace_set_global_notrace()
is added to keep the old filter functions which are used to modify
the generic function tracers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that functions may be selected individually, it only makes sense
that we should allow dynamically allocated trace structures to
be traced. This will allow perf to allocate a ftrace_ops structure
at runtime and use it to pick and choose which functions that
structure will trace.
Note, a dynamically allocated ftrace_ops will always be called
indirectly instead of being called directly from the mcount in
entry.S. This is because there's no safe way to prevent mcount
from being preempted before calling the function, unless we
modify every entry.S to do so (not likely). Thus, dynamically allocated
functions will now be called by the ftrace_ops_list_func() that
loops through the ops that are allocated if there are more than
one op allocated at a time. This loop is protected with a
preempt_disable.
To determine if an ftrace_ops structure is allocated or not, a new
util function was added to the kernel/extable.c called
core_kernel_data(), which returns 1 if the address is between
_sdata and _edata.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_ops that are registered to trace functions can now be
agnostic to each other in respect to what functions they trace.
Each ops has their own hash of the functions they want to trace
and a hash to what they do not want to trace. A empty hash for
the functions they want to trace denotes all functions should
be traced that are not in the notrace hash.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a hash is modified and might be in use, we need to perform
a schedule RCU operation on it, as the hashes will soon be used
directly in the function tracer callback.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is a step towards each ops structure defining its own set
of functions to trace. As the current code with pid's and such
are specific to the global_ops, it is restructured to be used
with the global ops.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to allow different ops to enable different functions,
the ftrace_startup() and ftrace_shutdown() functions need the
ops parameter passed to them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the enabled_functions file that is used to show all the
functions that have been enabled for tracing as well as their
ref counts. This helps seeing if any function has been registered
and what functions are being traced.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every function has its own record that stores the instruction
pointer and flags for the function to be traced. There are only
two flags: enabled and free. The enabled flag states that tracing
for the function has been enabled (actively traced), and the free
flag states that the record no longer points to a function and can
be used by new functions (loaded modules).
These flags are now moved to the MSB of the flags (actually just
the top 32bits). The rest of the bits (30 bits) are now used as
a ref counter. Everytime a tracer register functions to trace,
those functions will have its counter incremented.
When tracing is enabled, to determine if a function should be traced,
the counter is examined, and if it is non-zero it is set to trace.
When a ftrace_ops is registered to trace functions, its hashes
are examined. If the ftrace_ops filter_hash count is zero, then
all functions are set to be traced, otherwise only the functions
in the hash are to be traced. The exception to this is if a function
is also in the ftrace_ops notrace_hash. Then that function's counter
is not incremented for this ftrace_ops.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When filtering, allocate a hash to insert the function records.
After the filtering is complete, assign it to the ftrace_ops structure.
This allows the ftrace_ops structure to have a much smaller array of
hash buckets instead of wasting a lot of memory.
A read only empty_hash is created to be the minimum size that any ftrace_ops
can point to.
When a new hash is created, it has the following steps:
o Allocate a default hash.
o Walk the function records assigning the filtered records to the hash
o Allocate a new hash with the appropriate size buckets
o Move the entries from the default hash to the new hash.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Combine the filter and notrace hashes to be accessed by a single entity,
the global_ops. The global_ops is a ftrace_ops structure that is passed
to different functions that can read or modify the filtering of the
function tracer.
The ftrace_ops structure was modified to hold a filter and notrace
hashes so that later patches may allow each ftrace_ops to have its own
set of rules to what functions may be filtered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When multiple users are allowed to have their own set of functions
to trace, having the FTRACE_FL_FILTER flag will not be enough to
handle the accounting of those users. Each user will need their own
set of functions.
Replace the FTRACE_FL_FILTER with a filter_hash instead. This is
temporary until the rest of the function filtering accounting
gets in.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To prepare for the accounting system that will allow multiple users of
the function tracer, having the FTRACE_FL_NOTRACE as a flag in the
dyn_trace record does not make sense.
All ftrace_ops will soon have a hash of functions they should trace
and not trace. By making a global hash of functions not to trace makes
this easier for the transition.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This partially reverts commit e6e1e25935.
That commit changed the structure layout of the trace structure, which
in turn broke PowerTOP (1.9x generation) quite badly.
I appreciate not wanting to expose the variable in question, and
PowerTOP was not using it, so I've replaced the variable with just a
padding field - that way if in the future a new field is needed it can
just use this padding field.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code used for matching functions is almost identical between normal
selecting of functions and using the :mod: feature of set_ftrace_notrace.
Consolidate the two users into one function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are three locations that perform almost identical functions in order
to update the ftrace_trace_function (the ftrace function variable that gets
called by mcount).
Consolidate these into a single function called update_ftrace_function().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The updating of a function record is moved to a single function. This will allow
us to add specific changes in one location for both modules and kernel
functions.
Later patches will determine if the function record itself needs to be updated
(which enables the mcount caller), or just the ftrace_ops needs the update.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since we disable all function tracer processing if we detect
that a modification of a instruction had failed, we do not need
to track that the record has failed. No more ftrace processing
is allowed, and the FTRACE_FL_CONVERTED flag is pointless.
The FTRACE_FL_CONVERTED flag was used to denote records that were
successfully converted from mcount calls into nops. But if a single
record fails, all of ftrace is disabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since we disable all function tracer processing if we detect
that a modification of a instruction had failed, we do not need
to track that the record has failed. No more ftrace processing
is allowed, and the FTRACE_FL_FAILED flag is pointless.
Removing this flag simplifies some of the code, but some ftrace_disabled
checks needed to be added or move around a little.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The failures file in the debugfs tracing directory would list the
functions that failed to convert when the old dead ftrace daemon
tried to update code but failed. Since this code is now dead along
with the daemon the failures file is useless. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The disabling of interrupts around ftrace_update_code() was used
to protect against the evil ftrace daemon from years past. But that
daemon has long been killed. It is safe to keep interrupts enabled
while updating the initial mcount into nops.
The ftrace_mutex is also held which keeps other users at bay.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Let FTRACE_WARN_ON() be used as a stand alone statement or
inside a conditional: if (FTRACE_WARN_ON(x))
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If function tracing is enabled, a read of the filter files will
cause the call to stop_machine to update the function trace sites.
It should only call stop_machine on write.
Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
include/linux/perf_event.h
Merge reason: pick up the latest jump-label enhancements, they are cooked ready.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Neil Brown pointed out that lock_depth somehow escaped the BKL
removal work. Let's get rid of it now.
Note that the perf scripting utilities still have a bunch of
code for dealing with common_lock_depth in tracepoints; I have
left that in place in case anybody wants to use that code with
older kernels.
Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
running following commands:
# enable the binary option
echo 1 > ./options/bin
# disable context info option
echo 0 > ./options/context-info
# tracing only events
echo 1 > ./events/enable
cat trace_pipe
plus forcing system to generate many tracing events,
is causing lockup (in NON preemptive kernels) inside
tracing_read_pipe function.
The issue is also easily reproduced by running ltp stress test.
(ftrace_stress_test.sh)
The reasons are:
- bin/hex/raw output functions for events are set to
trace_nop_print function, which prints nothing and
returns TRACE_TYPE_HANDLED value
- LOST EVENT trace do not handle trace_seq overflow
These reasons force the while loop in tracing_read_pipe
function never to break.
The attached patch fixies handling of lost event trace, and
changes trace_nop_print to print minimal info, which is needed
for the correct tracing_read_pipe processing.
v2 changes:
- omit the cond_resched changes by trace_nop_print changes
- WARN changed to WARN_ONCE and added info to be able
to find out the culprit
v3 changes:
- make more accurate patch comment
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20110325110518.GC1922@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The file debugfs/tracing/printk_formats maps the addresses
to the formats that are used by trace_bprintk() so that userspace
tools can read the buffer and be able to decode trace_bprintk events
to get the format saved when reading the ring buffer directly.
This is because trace_bprintk() does not store the format into the
buffer, but just the address of the format, which is hidden in
the kernel memory.
But currently it only exports trace_bprintk()s from the kernel core
and not for modules. The modules need their formats exported
as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() formats for modules do not show up in the
debugfs/tracing/printk_formats file. Only the formats that are
for trace_printk()s that are in the kernel core.
To facilitate the change to add trace_printk() formats from modules
into that file as well, we need to convert the structure that
holds the formats from char fmt[], into const char *fmt,
and allocate them separately.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue
perf, x86: P4 PMU - Read proper MSR register to catch unflagged overflows
perf symbols: Look at .dynsym again if .symtab not found
perf build-id: Add quirk to deal with perf.data file format breakage
perf session: Pass evsel in event_ops->sample()
perf: Better fit max unprivileged mlock pages for tools needs
perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()
perf top: Fix uninitialized 'counter' variable
tracing: Fix set_ftrace_filter probe function display
perf, x86: Fix Intel fixed counters base initialization
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
If one or more function probes (like traceon) are enabled,
and there's no other function filter, the first probe
func is skipped (which one depends on the position in the hash).
$ echo sys_open:traceon sys_close:traceon > ./set_ftrace_filter
$ cat set_ftrace_filter
#### all functions enabled ####
sys_close:traceon:unlimited
$
The reason was, that in the case of no other function filter,
the func_pos was not properly updated before calling t_hash_start.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1297874134-7008-1-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
trace, filters: Initialize the match variable in process_ops() properly
trace, documentation: Fix branch profiling location in debugfs
oprofile, s390: Cleanups
oprofile, s390: Remove hwsampler_files.c and merge it into init.c
perf: Fix tear-down of inherited group events
perf: Reorder & optimize perf_event_context to remove alignment padding on 64 bit builds
perf: Handle stopped state with tracepoints
perf: Fix the software events state check
perf, powerpc: Handle events that raise an exception without overflowing
perf, x86: Use INTEL_*_CONSTRAINT() for all PEBS event constraints
perf, x86: Clean up SandyBridge PEBS events
perf lock: Fix sorting by wait_min
perf tools: Version incorrect with some versions of grep
perf evlist: New command to list the names of events present in a perf.data file
perf script: Add support for H/W and S/W events
perf script: Add support for dumping symbols
perf script: Support custom field selection for output
perf script: Move printing of 'common' data from print_event and rename
perf tracing: Remove print_graph_cpu and print_graph_proc from trace-event-parse
perf script: Change process_event prototype
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
Make sure the 'match' variable always has a value.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The debugfs interface for branch profiling is through
/sys/kernel/debug/tracing/trace_stat/branch_annotated
/sys/kernel/debug/tracing/trace_stat/branch_all
so update the Kconfig accordingly.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <alpine.DEB.2.00.1103161716320.11407@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In blk_add_trace_rq, we only chose the minor 2 bits from
request's cmd_flags and did some check for discard.
so most of other flags(e.g, REQ_SYNC) are missing.
For example, with a sync write after blkparse we get:
8,16 1 1 0.001776503 7509 A WS 1349632 + 1024 <- (8,17) 1347584
8,16 1 2 0.001776813 7509 Q WS 1349632 + 1024 [dd]
8,16 1 3 0.001780395 7509 G WS 1349632 + 1024 [dd]
8,16 1 5 0.001783186 7509 I W 1349632 + 1024 [dd]
8,16 1 11 0.001816987 7509 D W 1349632 + 1024 [dd]
8,16 0 2 0.006218192 0 C W 1349632 + 1024 [0]
Since now we have integrated the flags of both bio and request,
it is safe to pass rq->cmd_flags directly to __blk_add_trace.
With this patch, after a sync write we get:
8,16 1 1 0.001776900 5425 A WS 1189888 + 1024 <- (8,17) 1187840
8,16 1 2 0.001777179 5425 Q WS 1189888 + 1024 [dd]
8,16 1 3 0.001780797 5425 G WS 1189888 + 1024 [dd]
8,16 1 5 0.001783402 5425 I WS 1189888 + 1024 [dd]
8,16 1 11 0.001817468 5425 D WS 1189888 + 1024 [dd]
8,16 0 2 0.005640709 0 C WS 1189888 + 1024 [0]
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the kernel command line declares a tracer "ftrace=sometracer" and
that tracer is either not defined or is enabled after irqsoff,
then the irqs off selftest will fail with the following error:
Testing tracer irqsoff:
------------[ cut here ]------------
WARNING: at /home/rostedt/work/autotest/nobackup/linux-test.git/kernel/trace/tra
ce.c:713 update_max_tr_single+0xfa/0x11b()
Hardware name:
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.38-rc8-test #1
Call Trace:
[<c0441d9d>] ? warn_slowpath_common+0x65/0x7a
[<c049adb2>] ? update_max_tr_single+0xfa/0x11b
[<c0441dc1>] ? warn_slowpath_null+0xf/0x13
[<c049adb2>] ? update_max_tr_single+0xfa/0x11b
[<c049e454>] ? stop_critical_timing+0x154/0x204
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049e529>] ? time_hardirqs_on+0x25/0x28
[<c0468bca>] ? trace_hardirqs_on_caller+0x18/0x12f
[<c0468cec>] ? trace_hardirqs_on+0xb/0xd
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049b6b8>] ? register_tracer+0xf8/0x1a3
[<c14e93fe>] ? init_irqsoff_tracer+0xd/0x11
[<c040115e>] ? do_one_initcall+0x71/0x121
[<c14e93f1>] ? init_irqsoff_tracer+0x0/0x11
[<c14ce3a9>] ? kernel_init+0x13a/0x1b6
[<c14ce26f>] ? kernel_init+0x0/0x1b6
[<c0403842>] ? kernel_thread_helper+0x6/0x10
---[ end trace e93713a9d40cd06c ]---
.. no entries found ..FAILED!
What happens is the "ftrace=..." will expand the ring buffer to its
default size (from its minimum size) but it will not expand the
max ring buffer (the ring buffer to store maximum latencies).
When the irqsoff test runs, it will call the ring buffer swap routine
that checks if the max ring buffer is the same size as the normal
ring buffer, and will fail if it is not. This causes the test to fail.
The solution is to expand the max ring buffer before running the self
test if the max ring buffer is used by that tracer and the normal ring
buffer is expanded. The max ring buffer should be shrunk again after
the test is done to save space.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trace events belonging to a module only exists when the module is
loaded. Well, we can use trace_set_clr_event funtion to enable some
trace event at the module init routine, so that we will not miss
something while loading then module.
So, Export the trace_set_clr_event function so that module can use it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
LKML-Reference: <1289196312-25323-1-git-send-email-yuanhan.liu@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "Delta way too big" warning might appear on a system with a
unstable shed clock right after the system is resumed and tracing
was enabled at time of suspend.
Since it's not realy a bug, and the unstable sched clock is working
fast and reliable otherwise, Steven suggested to keep using the
sched clock in any case and just to make note in the warning itself.
v2 changes:
- added #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20110218145219.GD2604@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Formatting change only to improve code readability. No code changes except to
introduce intermediate variables.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-13-git-send-email-dhsharp@google.com>
[ Keep variable declarations and assignment separate ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-6-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The lock_depth field in the event headers was added as a temporary
data point for help in removing the BKL. Now that the BKL is pretty
much been removed, we can remove this field.
This in turn changes the header from 12 bytes to 8 bytes,
removing the 4 byte buffer that gcc would insert if the first field
in the data load was 8 bytes in size.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-3-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add an "overwrite" trace_option for ftrace to control whether the buffer should
be overwritten on overflow or not. The default remains to overwrite old events
when the buffer is full. This patch adds the option to instead discard newest
events when the buffer is full. This is useful to get a snapshot of traces just
after enabling traces. Dropping the current event is also a simpler code path.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291844807-15481-1-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we enable trace events to trace block actions, We use
blk_fill_rwbs_rq to analyze the corresponding actions
in request's cmd_flags, but we only choose the minor 2 bits
from it, so most of other flags(e.g, REQ_SYNC) are missing.
For example, with a sync write we get:
write_test-2409 [001] 160.013869: block_rq_insert: 3,64 W 0 () 258135 + =
8 [write_test]
Since now we have integrated the flags of both bio and request,
it is safe to pass rq->cmd_flags directly to blk_fill_rwbs and
blk_fill_rwbs_rq isn't needed any more.
With this patch, after a sync write we get:
write_test-2417 [000] 226.603878: block_rq_insert: 3,64 WS 0 () 258135 +=
8 [write_test]
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This reverts commit 5e38ca8f3e.
Breaks the build of several !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
architectures.
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Message-ID: <20110217171823.GB17058@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add NULL check for avoiding NULL pointer deref.
This bug has been introduced by:
1ff511e35ed8: tracing/kprobes: Add bitfield type
which causes a null pointer dereference bug when kprobe-tracer
parses an argument without type.
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20110214054807.8919.69740.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
When the fuction graph tracer starts, it needs to make a special
stack for each task to save the real return values of the tasks.
All running tasks have this stack created, as well as any new
tasks.
On CPU hot plug, the new idle task will allocate a stack as well
when init_idle() is called. The problem is that cpu hotplug does
not create a new idle_task. Instead it uses the idle task that
existed when the cpu went down.
ftrace_graph_init_task() will add a new ret_stack to the task
that is given to it. Because a clone will make the task
have a stack of its parent it does not check if the task's
ret_stack is already NULL or not. When the CPU hotplug code
starts a CPU up again, it will allocate a new stack even
though one already existed for it.
The solution is to treat the idle_task specially. In fact, the
function_graph code already does, just not at init_idle().
Instead of using the ftrace_graph_init_task() for the idle task,
which that function expects the task to be a clone, have a
separate ftrace_graph_init_idle_task(). Also, we will create a
per_cpu ret_stack that is used by the idle task. When we call
ftrace_graph_init_idle_task() it will check if the idle task's
ret_stack is NULL, if it is, then it will assign it the per_cpu
ret_stack.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cdrom: support devices that have check_events but not media_changed
cfq-iosched: Don't wait if queue already has requests.
blkio-throttle: Avoid calling blkiocg_lookup_group() for root group
cfq: rename a function to give it more appropriate name
cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily.
drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y
loop: queue_lock NULL pointer derefence in blk_throtl_exit
drivers/block/Makefile: replace the use of <module>-objs with <module>-y
blktrace: Don't output messages if NOTIFY isn't set.
tracing_enabled should not be used, it is heavy weight and does not
do much in helping lower the overhead.
tracing_on should be used instead. Warn users to use tracing_on
when tracing_enabled is used as it will soon be removed from the
tracing directory.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace events sched_switch and sched_wakeup do the same thing
as the stand alone sched_switch tracer does. It is no longer needed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The warning "Delta way too big" warning might appear on a system with
unstable shed clock right after the system is resumed and tracing
was enabled during the suspend.
Since it's not realy bug, and the unstable sched clock is working
fast and reliable otherwise, Steven suggested to keep using the
sched clock in any case and just to make note in the warning itself.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1296649698-6003-1-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Many system calls are unimplemented and mapped to sys_ni_syscall, but at
boot ftrace would still search through every syscall metadata entry for
a match which wouldn't be there.
This patch adds causes the search to terminate early if the system call
is not mapped.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-7-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some architectures have unusual symbol names and the generic code to
match the symbol name with the function name for the syscall metadata
will fail. For example, symbols on PPC64 start with a period and the
generic code will fail to match them.
This patch moves the match logic out into a separate function which an
arch can override by defining ARCH_HAS_SYSCALL_MATCH_SYM_NAME in
asm/ftrace.h and implementing arch_syscall_match_sym_name.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-5-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some architectures use non-trivial system call tables and will not work
with the generic arch_syscall_addr code. For example, PowerPC64 uses a
table of twin long longs.
This patch makes the generic arch_syscall_addr weak to allow
architectures with non-trivial system call tables to override it.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-4-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the ftrace events now checking if the syscall_nr is valid upon
initialisation it should no longer be possible to register or unregister
a syscall event without a valid syscall_nr since they should not be
created. This adds a WARN_ON_ONCE in the register and unregister
functions to locate potential regressions in the future.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-3-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
FTRACE_SYSCALLS would create events for each and every system call, even
if it had failed to map the system call's name with it's number. This
resulted in a number of events being created that would not behave as
expected.
This could happen, for example, on architectures who's symbol names are
unusual and will not match the system call name. It could also happen
with system calls which were mapped to sys_ni_syscall.
This patch changes the default system call number in the metadata to -1.
If the system call name from the metadata is not successfully mapped to
a system call number during boot, than the event initialisation routine
will now return an error, preventing the event from being created.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-2-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because the filters are processed first and then activated
(added to the call), we no longer need to worry about the preds
of the filter in __alloc_preds() being used. As the filter that
is allocating preds is not activated yet.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When creating a new filter, instead of allocating the filter to the
event call first and then processing the filter, it is easier to
process a temporary filter and then just swap it with the call filter.
By doing this, it simplifies the code.
A filter is allocated and processed, when it is done, it is
swapped with the call filter, synchronize_sched() is called to make
sure all callers are done with the old filter (filters are called
with premption disabled), and then the old filter is freed.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that the filter logic does not require to save the pred results
on the stack, we can increase the max number of preds we allow.
As the preds are index by a short value, and we use the MSBs as flags
we can increase the max preds to 2^14 (16384) which should be way
more than enough.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The MAX_FILTER_PRED is only needed by the kernel/trace/*.c files.
Move it to kernel/trace/trace.h.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are many cases that a filter will contain multiple ORs or
ANDs together near the leafs. Walking up and down the tree to get
to the next compare can be a waste.
If there are several ORs or ANDs together, fold them into a single
pred and allocate an array of the conditions that they check.
This will speed up the filter by linearly walking an array
and can still break out if a short circuit condition is met.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the filter walks a tree to determine if a match is made or not,
if the tree was incorrectly created, it could cause an infinite loop.
Add a check to walk the entire tree before assigning it as a filter
to make sure the tree is correct.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The test if we should break out early for OR and AND operations
can be optimized by comparing the current result with
(pred->op == OP_OR)
That is if the result is true and the op is an OP_OR, or
if the result is false and the op is not an OP_OR (thus an OP_AND)
we can break out early in either case. Otherwise we continue
processing.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the filter_match_preds() requires a stack to push
and pop the preds to determine if the filter matches the record or not.
This has two drawbacks:
1) It requires a stack to store state information. As this is done
in fast paths we can't allocate the storage for this stack, and
we can't use a global as it must be re-entrant. The stack is stored
on the kernel stack and this greatly limits how many preds we
may allow.
2) All conditions are calculated even when a short circuit exists.
a || b will always calculate a and b even though a was determined
to be true.
Using a tree we can walk a constant structure that will save
the state as we go. The algorithm is simply:
pred = root;
do {
switch (move) {
case MOVE_DOWN:
if (OR or AND) {
pred = left;
continue;
}
if (pred == root)
break;
match = pred->fn();
pred = pred->parent;
move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT;
continue;
case MOVE_UP_FROM_LEFT:
/* Only OR or AND can be a parent */
if (match && OR || !match && AND) {
/* short circuit */
if (pred == root)
break;
pred = pred->parent;
move = left child ?
MOVE_UP_FROM_LEFT :
MOVE_UP_FROM_RIGHT;
continue;
}
pred = pred->right;
move = MOVE_DOWN;
continue;
case MOVE_UP_FROM_RIGHT:
if (pred == root)
break;
pred = pred->parent;
move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT;
continue;
}
done = 1;
} while (!done);
This way there's no strict limit to how many preds we allow
and it also will short circuit the logical operations when possible.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently we allocate an array of pointers to filter_preds, and then
allocate a separate filter_pred for each item in the array.
This adds slight overhead in the filters as it needs to derefernce
twice to get to the op condition.
Allocating the preds themselves in a single array removes a dereference
as well as helps on the cache footprint.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By separating out the reseting of the filter->n_preds to zero from
the reallocation of preds for the filter, we can reset groups of
filters first, call synchronize_sched() just once, and then reallocate
each of the filters in the system group.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For every filter that is made, we create predicates to hold every
operation within the filter. We have a max of 32 predicates that we
can hold. Currently, we allocate all 32 even if we only need to
use one.
Part of the reason we do this is that the filter can be used at
any moment by any event. Fortunately, the filter is only used
with preemption disabled. By reseting the count of preds used "n_preds"
to zero, then performing a synchronize_sched(), we can safely
free and reallocate a new array of preds.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ops OR and AND act different from the other ops, as they
are the only ones to take other ops as their arguements.
These ops als change the logic of the filter_match_preds.
By removing the OR and AND fn's we can also remove the val1 and val2
that is passed to all other fn's and are unused.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The n_preds field of a file can change at anytime, and even can become
zero, just as the filter is about to be processed by an event.
In the case that is zero on entering the filter, return 1, telling
the caller the event matchs and should be trace.
Also use a variable and assign it with ACCESS_ONCE() such that the
count stays consistent within the function.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add bitfield type for tracing arguments on kprobe-tracer. The syntax of
a bitfield type is:
b<bit-size>@<bit-offset>/<container-size>
e.g.
Accessing 2 bits-width field with 4 bits-offset in 32 bits-width data at
4 bytes offseted from the address pointed by AX register:
+4(%ax):b2@4/32
Since the width of container data depends on the arch, so I just added
the container-size at the end.
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20110204125205.9507.11363.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since strict_strtol() accepts minus digits started with '-', it doesn't
need to invert after converting.
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20110204125153.9507.49335.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace_event structures are placed in the _ftrace_events
section, and at link time, the linker makes one large array of all
the trace_event structures. On boot up, this array is read (much like
the initcall sections) and the events are processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the _ftrace_event section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The _ftrace_event section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
During early boot, local IRQ is disabled until IRQ subsystem is
properly initialized. During this time, no one should enable
local IRQ and some operations which usually are not allowed with
IRQ disabled, e.g. operations which might sleep or require
communications with other processors, are allowed.
lockdep tracked this with early_boot_irqs_off/on() callbacks.
As other subsystems need this information too, move it to
init/main.c and make it generally available. While at it,
toggle the boolean to early_boot_irqs_disabled instead of
enabled so that it can be initialized with %false and %true
indicates the exceptional condition.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now if we enable blktrace, cfq has too many messages output to the
trace buffer. It is fine if we don't specify any action mask.
But if I do like this:
blktrace /dev/sdb -a issue -a complete -o - | blkparse -i -
I only want to see 'D' and 'C', while with the following command
dd if=/mnt/ocfs2/test of=/dev/null bs=4k count=1 iflag=direct
I will get(with a 2.6.37 vanilla kernel):
8,16 0 0 0.000000000 0 m N cfq3805 alloced
8,16 0 0 0.000004126 0 m N cfq3805 insert_request
8,16 0 0 0.000004884 0 m N cfq3805 add_to_rr
8,16 0 0 0.000008417 0 m N cfq workload slice:300
8,16 0 0 0.000009557 0 m N cfq3805 set_active wl_prio:0 wl_type:2
8,16 0 0 0.000010640 0 m N cfq3805 fifo= (null)
8,16 0 0 0.000011193 0 m N cfq3805 dispatch_insert
8,16 0 0 0.000012221 0 m N cfq3805 dispatched a request
8,16 0 0 0.000012802 0 m N cfq3805 activate rq, drv=1
8,16 0 1 0.000013181 3805 D R 114759 + 8 [dd]
8,16 0 2 0.000164244 0 C R 114759 + 8 [0]
8,16 0 0 0.000167997 0 m N cfq3805 complete rqnoidle 0
8,16 0 0 0.000168782 0 m N cfq3805 set_slice=100
8,16 0 0 0.000169874 0 m N cfq3805 arm_idle: 8 group_idle: 0
8,16 0 0 0.000170189 0 m N cfq schedule dispatch
8,16 0 0 0.000397938 0 m N cfq3805 slice expired t=0
8,16 0 0 0.000399763 0 m N cfq3805 sl_used=1 disp=1 charge=1 iops=0 sect=8
8,16 0 0 0.000400227 0 m N cfq3805 del_from_rr
8,16 0 0 0.000400882 0 m N cfq3805 put_queue
See, there are 19 lines while I only need 2. I don't think it is
appropriate for a user.
So this patch will disable any messages if the BLK_TC_NOTIFY isn't set.
Now the output for the same command will look like:
8,16 0 1 0.000000000 4908 D R 114759 + 8 [dd]
8,16 0 2 0.000146827 0 C R 114759 + 8 [0]
Yes, it is what I want to see.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fix a bunch of
warning: ‘inline’ is not at beginning of declaration
messages when building a 'make allyesconfig' kernel with -Wextra.
These warnings are trivial to kill, yet rather annoying when building with
-Wextra.
The more we can cut down on pointless crap like this the better (IMHO).
A previous patch to do this for a 'allnoconfig' build has already been
merged. This just takes the cleanup a little further.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: avoid pointless blocked-task warnings
rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status
rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi()
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, olpc: Add missing Kconfig dependencies
x86, mrst: Set correct APB timer IRQ affinity for secondary cpu
x86: tsc: Fix calibration refinement conditionals to avoid divide by zero
x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timekeeping: Make local variables static
time: Rename misnamed minsec argument of clocks_calc_mult_shift()
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove syscall_exit_fields
tracing: Only process module tracepoints once
perf record: Add "nodelay" mode, disabled by default
perf sched: Fix list of events, dropping unsupported ':r' modifier
Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return"
perf top: Fix annotate segv
perf evsel: Fix order of event list deletion
There is no need for syscall_exit_fields as the syscall
exit event class can already host the fields in its structure,
like most other trace events do by default. Use that
default behavior instead.
Following this scheme, we don't need anymore to override the
get_fields() callback of the syscall exit event class either.
Hence both syscall_exit_fields and syscall_get_exit_fields() can
be removed.
Also changed some indentation to keep the following under 80
characters:
".fields = LIST_HEAD_INIT(event_class_syscall_exit.fields),"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4D301C0E.8090408@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
We normally just use the BIO_UPTODATE flag to signal 0/-EIO. If
we have more information available, we should pass that along to
the trace output.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
DEFINE_TRACE should also exist when CONFIG_EVENT_TRACING=n. Otherwise, setting
only TRACEPOINTS=y is broken.
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20101028153117.GA4051@Krystal>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While running my ftrace stress test, this showed up:
BUG: sleeping function called from invalid context at mm/mmap.c:233
...
note: cat[3293] exited with preempt_count 1
The bug was introduced by commit 91e86e560d
("tracing: Fix recursive user stack trace")
Cc: <stable@kernel.org>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4D0089AC.1020802@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Function-scope statics are discouraged because they are
easily overlooked and can cause subtle bugs/races due to
their global (non-SMP safe) nature.
Linus noticed that we did this for sched_param - at minimum
make the const.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: Message-ID: <AANLkTinotRxScOHEb0HgFgSpGPkq_6jKTv5CfvnQM=ee@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
blktrace.c block bio complete callback needs to gain a new argument to reflect
the newly added "error" tracepoint argument. This is needed to match the new
block_bio_complete TRACE_EVENT as of
commit de983a7bfcb7c020901ca6e2314cf55a4207ab5a.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Jeff Moyer <jmoyer@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
sched: Change wait_for_completion_*_timeout() to return a signed long
sched, autogroup: Fix reference leak
sched, autogroup: Fix potential access to freed memory
sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
sched: Move periodic share updates to entity_tick()
printk: Use this_cpu_{read|write} api on printk_pending
sched: Make pushable_tasks CONFIG_SMP dependant
sched: Add 'autogroup' scheduling feature: automated per session task groups
sched: Fix unregister_fair_sched_group()
sched: Remove unused argument dest_cpu to migrate_task()
mutexes, sched: Introduce arch_mutex_cpu_relax()
sched: Add some clock info to sched_debug
cpu: Remove incorrect BUG_ON
cpu: Remove unused variable
sched: Fix UP build breakage
sched: Make task dump print all 15 chars of proc comm
sched: Update tg->shares after cpu.shares write
sched: Allow update_cfs_load() to update global load
sched: Implement demand based update_cfs_load()
...
Add these new power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend
The old C-state/idle accounting events:
power:power_start
power:power_end
Have now a replacement (but we are still keeping the old
tracepoints for compatibility):
power:cpu_idle
and
power:power_frequency
is replaced with:
power:cpu_frequency
power:machine_suspend is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.
the type= field got removed from both, it was never
used and the type is differed by the event type itself.
perf timechart userspace tool gets adjusted in a separate patch.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-3-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.
intel_idle can a be module though...
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Jean Pihet <j-pihet@ti.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-2-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
Fix two related problems in the event-copying loop of
ring_buffer_read_page.
The loop condition for copying events is off-by-one.
"len" is the remaining space in the caller-supplied page.
"size" is the size of the next event (or two events).
If len == size, then there is just enough space for the next event.
size was set to rb_event_ts_length, which may include the size of two
events if the first event is a time-extend, in order to assure time-
extends are kept together with the event after it. However,
rb_advance_reader always advances by one event. This would result in the
event after any time-extend being duplicated. Instead, get the size of
a single event for the memcpy, but use rb_event_ts_length for the loop
condition.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1293064704-8101-1-git-send-email-dhsharp@google.com>
LKML-Reference: <AANLkTin7nLrRPc9qGjdjHbeVDDWiJjAiYyb-L=gH85bx@mail.gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
The file_ops struct for the "trace" special file defined llseek as seq_lseek().
However, if the file was opened for writing only, seq_open() was not called,
and the seek would dereference a null pointer, file->private_data.
This patch introduces a new wrapper for seq_lseek() which checks if the file
descriptor is opened for reading first. If not, it does nothing.
Cc: <stable@kernel.org>
Signed-off-by: Slava Pestov <slavapestov@google.com>
LKML-Reference: <1290640396-24179-1-git-send-email-slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Remove incorrect open-coded container_of()
perf record: Handle restrictive permissions in /proc/{kallsyms,modules}
x86/kprobes: Prevent kprobes to probe on save_args()
irq_work: Drop cmpxchg() result
perf: Fix owner-list vs exit
x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG
tracing: Fix recursive user stack trace
perf,hw_breakpoint: Initialize hardware api earlier
x86: Ignore trap bits on single step exceptions
tracing: Force arch_local_irq_* notrace for paravirt
tracing: Fix module use of trace_bprintk()
Currently we have in something like the sched_switch event:
field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
When a userspace tool such as perf tries to parse this, the
TASK_COMM_LEN is meaningless. This is done because the TRACE_EVENT() macro
simply uses a #len to show the string of the length. When the length is
an enum, we get a string that means nothing for tools.
By adding a static buffer and a mutex to protect it, we can store the
string into that buffer with snprintf and show the actual number.
Now we get:
field:char prev_comm[16]; offset:12; size:16; signed:1;
Something much more useful.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This adds a new trace event internal flag that allows them to be
used in perf by non privileged users in case of task bound tracing.
This is desired for syscalls tracepoint because they don't leak
global system informations, like some other tracepoints.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jason Baron <jbaron@redhat.com>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] kprobes: Fix the return address of multiple kretprobes
[S390] kprobes: disable interrupts throughout
[S390] ftrace: build without frame pointers on s390
[S390] mm: add devmem_is_allowed() for STRICT_DEVMEM checking
[S390] vmlogrdr: purge after recording is switched off
[S390] cio: fix incorrect ccw_device_init_count
[S390] tape: add medium state notifications
[S390] fix get_user_pages_fast
The user stack trace can fault when examining the trace. Which
would call the do_page_fault handler, which would trace again,
which would do the user stack trace, which would fault and call
do_page_fault again ...
Thus this is causing a recursive bug. We need to have a recursion
detector here.
[ Resubmitted by Jiri Olsa ]
[ Eric Dumazet recommended using __this_cpu_* instead of __get_cpu_* ]
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1289390172-9730-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
at this point is:
- various checks inside the block layer.
- sanity checks in bio based drivers.
- now unused bio_empty_barrier helper.
- Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
but Xen really needs to sort out it's barrier situaton.
- setting of ordered tags in uas - dead code copied from old scsi
drivers.
- scsi different retry for barriers - it's dead and should have been
removed when flushes were converted to FS requests.
- blktrace handling of barriers - removed. Someone who knows blktrace
better should add support for REQ_FLUSH and REQ_FUA, though.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
s390 doesn't need FRAME_POINTERS in order to have a working function tracer.
We don't need frame pointers in order to get strack traces since we always
have valid backchains by using the -mkernel-backchain gcc option.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Andrew Morton pointed out almost all sched_setscheduler() callers are
using fixed parameters and can be converted to static. It reduces runtime
memory use a little.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that include/linux/kdb.h properly exports all the functions
required to dynamically add a kdb shell command, the reference to the
private kdb header can be removed.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
* 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
BKL: introduce CONFIG_BKL.
dabusb: remove the BKL
sunrpc: remove the big kernel lock
init/main.c: remove BKL notations
blktrace: remove the big kernel lock
rtmutex-tester: make it build without BKL
dvb-core: kill the big kernel lock
dvb/bt8xx: kill the big kernel lock
tlclk: remove big kernel lock
fix rawctl compat ioctls breakage on amd64 and itanic
uml: kill big kernel lock
parisc: remove big kernel lock
cris: autoconvert trivial BKL users
alpha: kill big kernel lock
isapnp: BKL removal
s390/block: kill the big kernel lock
hpet: kill BKL, add compat_ioctl
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (278 commits)
arm: remove machine_desc.io_pg_offst and .phys_io
arm: use addruart macro to establish debug mappings
arm: return both physical and virtual addresses from addruart
arm/debug: consolidate addruart macros for CONFIG_DEBUG_ICEDCC
ARM: make struct machine_desc definition coherent with its comment
eukrea_mbimxsd-baseboard: Pass the correct GPIO to gpio_free
cpuimx27: fix compile when ULPI is selected
mach-pcm037_eet: fix compile errors
Fixing ethernet driver compilation error for i.MX31 ADS board
cpuimx51: update board support
mx5: add cpuimx51sd module and its baseboard
iomux-mx51: fix GPIO_1_xx 's IOMUX configuration
imx-esdhc: update devices registration
mx51: add resources for SD/MMC on i.MX51
iomux-mx51: fix SD1 and SD2's iomux configuration
clock-mx51: rename CLOCK1 to CLOCK_CCGR for better readability
clock-mx51: factorize clk_set_parent and clk_get_rate
eukrea_mbimxsd: add support for DVI displays
cpuimx25 & cpuimx35: fix OTG port registration in host mode
i.MX31 and i.MX35 : fix errate TLSbo65953 and ENGcm09472
...
The tracing per_cpu buffers were limited to 999 CPUs for a mear
savings in stack space of a char array. Up the array to 30 characters
which is more than enough to hold a 64 bit number.
Reported-by: Robin Holt <holt@sgi.com>
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the binding of time extends to events we no longer need to use
the macro RB_TIMESTAMPS_PER_PAGE. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a condition to check if we should add a time extend or
not in the fast path. But this condition is racey (in the sense
that we can add a unnecessary time extend, but nothing that
can break anything). We later check if the time or event time
delta should be zero or have real data in it (not racey), making
this first check redundant.
This check may help save space once in a while, but really is
not worth the hassle to try to save some space that happens at
most 134 ms at a time.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the time between two timestamps is greater than
2^27 nanosecs (~134 ms) a time extend event is added that extends
the time difference to 59 bits (~18 years). This is due to
events only having a 27 bit field to store time.
Currently this time extend is a separate event. We add it just before
the event data that is being written to the buffer. But before
the event data is committed, the event data can also be discarded (as
with the case of filters). But because the time extend has already been
committed, it will stay in the buffer.
If lots of events are being filtered and no event is being
written, then every 134ms a time extend can be added to the buffer
without any data attached. To keep from filling the entire buffer
with time extends, a time extend will never be the first event
in a page because the page timestamp can be used. Time extends can
only fill the rest of a page with some data at the beginning.
This patch binds the time extend with the data. The difference here
is that the time extend is not committed before the data is added.
Instead, when a time extend is needed, the space reserved on
the ring buffer is the time extend + the data event size. The
time extend is added to the first part of the reserved block and
the data is added to the second. The time extend event is passed
back to the reserver, but since the reserver also uses a function
to find the data portion of the reserved block, no changes to the
ring buffer interface need to be made.
When a commit is discarded, we now remove both the time extend and
the event. With this approach no more than one time extend can
be in the buffer in a row. Data must always follow a time extend.
Thanks to Mathieu Desnoyers for suggesting this idea.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The delta between events is passed to the timestamp code by reference
and the timestamp code will reset the value. But it can be reset
from the caller. No need to pass it in by reference.
By changing the call to pass by value, lets gcc optimize the code
a bit more where it can store the delta in a register and not
worry about updating the reference.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original code for the ring buffer had locations that modified
the timestamp and that change was used by the callers. Now,
the timestamp is not reused by the callers and there is no reason
to pass it by reference.
By changing the call to pass by value, lets gcc optimize the code
a bit more where it can store the timestamp in a register and not
worry about updating the reference.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Gcc inlines the slow path of the ring buffer write which can
hurt performance. This patch simply forces the slow path function
rb_move_tail() to always be a function.
The ring_buffer_benchmark module with reader_disabled=1 shows that
this patch changes the time to record an event from 135 ns to
132 ns. (3 ns or 2.22% improvement)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function start_func_tracer() was incorrectly added in the
#ifdef CONFIG_FUNCTION_TRACER condition, but is still used even
when function tracing is not enabled.
The calls to register_ftrace_function() and register_ftrace_graph()
become nops (and their arguments are even ignored), thus there is
no reason to hide start_func_tracer() when function tracing is
not enabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to Jens, this code does not need the BKL at all,
it is sufficiently serialized by bd_mutex.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Even though the parent is recorded with the normal function tracing
of the latency tracers (irqsoff and wakeup), the function graph
recording is bogus.
This is due to the function graph messing with the return stack.
The latency tracers pass in as the parent CALLER_ADDR0, which
works fine for plain function tracing. But this causes bogus output
with the graph tracer:
3) <idle>-0 | d.s3. 0.000 us | return_to_handler();
3) <idle>-0 | d.s3. 0.000 us | _raw_spin_unlock_irqrestore();
3) <idle>-0 | d.s3. 0.000 us | return_to_handler();
3) <idle>-0 | d.s3. 0.000 us | trace_hardirqs_on();
The "return_to_handle()" call is the trampoline of the
function graph tracer, and is meaningless in this context.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The preempt and irqsoff tracers have three types of function tracers.
Normal function tracer, function graph entry, and function graph return.
Each of these use a complex dance to prevent recursion and whether
to trace the data or not (depending if interrupts are enabled or not).
This patch moves the duplicate code into a single routine, to
prevent future mistakes with modifying duplicate complex code.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The wakeup tracer has three types of function tracers. Normal
function tracer, function graph entry, and function graph return.
Each of these use a complex dance to prevent recursion and whether
to trace the data or not (depending on the wake_task variable).
This patch moves the duplicate code into a single routine, to
prevent future mistakes with modifying duplicate complex code.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add function graph support for wakeup latency tracer.
The graph output is enabled by setting the 'display-graph'
trace option.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1285243253-7372-4-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move trace_graph_function() and print_graph_headers_flags() functions
to the trace_function_graph.c to be globaly available.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1285243253-7372-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The check_irq_entry and check_irq_return could be called
from graph event context. In such case there's no graph
private data allocated. Adding checks to handle this case.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20100924154102.GB1818@jolsa.brq.redhat.com>
[ Fixed some grammar in the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
The config option used by archs to let the build system know that
the C version of the recordmcount works for said arch is currently
called HAVE_C_MCOUNT_RECORD which enables BUILD_C_RECORDMCOUNT. To
be more consistent with the name that all archs may use, it has been
renamed to HAVE_C_RECORDMCOUNT. This will be less confusing since
we are building a C recordmcount and not a mcount_record.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds the support for the C version of recordmcount and
compile times show ~ 12% improvement.
After verifying this works, other archs can add:
HAVE_C_MCOUNT_RECORD
in its Kconfig and it will use the C version of recordmcount
instead of the perl version.
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix
kernel/trace/trace_functions_graph.c: In function ‘trace_print_graph_duration’:
kernel/trace/trace_functions_graph.c:652: warning: comparison of distinct pointer types lacks a cast
when building 36-rc6 on a 32-bit due to the strict type check failing
in the min() macro.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Chase Douglas <chase.douglas@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20100929080823.GA13595@liondog.tnic>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Time stamps for the ring buffer are created by the difference between
two events. Each page of the ring buffer holds a full 64 bit timestamp.
Each event has a 27 bit delta stamp from the last event. The unit of time
is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
happen more than 134 milliseconds apart, a time extend is inserted
to add more bits for the delta. The time extend has 59 bits, which
is good for ~18 years.
Currently the time extend is committed separately from the event.
If an event is discarded before it is committed, due to filtering,
the time extend still exists. If all events are being filtered, then
after ~134 milliseconds a new time extend will be added to the buffer.
This can only happen till the end of the page. Since each page holds
a full timestamp, there is no reason to add a time extend to the
beginning of a page. Time extends can only fill a page that has actual
data at the beginning, so there is no fear that time extends will fill
more than a page without any data.
When reading an event, a loop is made to skip over time extends
since they are only used to maintain the time stamp and are never
given to the caller. As a paranoid check to prevent the loop running
forever, with the knowledge that time extends may only fill a page,
a check is made that tests the iteration of the loop, and if the
iteration is more than the number of time extends that can fit in a page
a warning is printed and the ring buffer is disabled (all of ftrace
is also disabled with it).
There is another event type that is called a TIMESTAMP which can
hold 64 bits of data in the theoretical case that two events happen
18 years apart. This code has not been implemented, but the name
of this event exists, as well as the structure for it. The
size of a TIMESTAMP is 16 bytes, where as a time extend is only
8 bytes. The macro used to calculate how many time extends can fit on
a page used the TIMESTAMP size instead of the time extend size
cutting the amount in half.
The following test case can easily trigger the warning since we only
need to have half the page filled with time extends to trigger the
warning:
# cd /sys/kernel/debug/tracing/
# echo function > current_tracer
# echo 'common_pid < 0' > events/ftrace/function/filter
# echo > trace
# echo 1 > trace_marker
# sleep 120
# cat trace
Enabling the function tracer and then setting the filter to only trace
functions where the process id is negative (no events), then clearing
the trace buffer to ensure that we have nothing in the buffer,
then write to trace_marker to add an event to the beginning of a page,
sleep for 2 minutes (only 35 seconds is probably needed, but this
guarantees the bug), and then finally reading the trace which will
trigger the bug.
This patch fixes the typo and prevents the false positive of that warning.
Reported-by: Hans J. Koch <hjk@linutronix.de>
Tested-by: Hans J. Koch <hjk@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Initialize the workqueue data structures *before* they are registered
so that they are ready for callbacks.
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <e3a3383fc370ac7086625bebe89d9480d7caf372.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enums for FTRACE_ENABLE_MCOUNT and FTRACE_DISABLE_MCOUNT were
used as commands to ftrace_run_update_code(). But these commands
were used by the old nasty ftrace daemon that has long been slain.
This is a clean up patch to remove the references to these enums
and simplify the code a little.
Reported-by: Wu Zhangjin <wuzhangjin@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function graph tracer funcgraph-irq option is zero, disable
tracing in IRQs. This makes the option have two effects.
1) When reading the trace file, do not display the functions that
happen in interrupt context (when detected)
2) [*new*] When recording a trace, skip those that are detected
to be in interrupt by the 'in_irq()' function
Note, in_irq() is updated at irq_enter() and irq_exit(). There are
still functions that are recorded by the function graph tracer that
is in interrupt context but outside the irq_enter/exit() routines.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's handy to be able to disable the irq related output
and not to have to jump over each irq related code, when
you have no interrest in it.
The option is by default enabled, so there's no change to
current behaviour. It affects only the final output, so all
the irq related data stay in the ring buffer.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20100907145344.GC1912@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we do:
# cd /sys/kernel/debug
# echo 'do_IRQ:traceon schedule:traceon sys_write:traceon' > \
set_ftrace_filter
# cat set_ftrace_filter
We get the following output:
#### all functions enabled ####
sys_write:traceon:unlimited
schedule:traceon:unlimited
do_IRQ:traceon:unlimited
This outputs two lists. One is the fact that all functions are
currently enabled for function tracing, the other has three probed
functions, which happen to have 'traceon' as their commands.
Currently, when reading the first list (functions enabled) the
seq_file code will receive a "NULL" from the t_next() function
causing it to exit early. This makes "read()" from userspace stop
reading the code at this boarder. Although read is allowed to do this,
some (broken) applications might consider this an end of file and
stop early.
This patch adds the start of the second list to t_next() when it
finishes the first list. It is a simple change and gives the
set_ftrace_filter file nicer reading ability.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch keeps track of the index within the elements of
set_ftrace_filter and if the position goes backwards, it nicely
resets and starts from the beginning again.
This allows for lseek and pread to work properly now.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The set_ftrace_filter uses seq_file and reads from two lists. The
pointer returned by t_next() can either be of type struct dyn_ftrace
or struct ftrace_func_probe. If there is a bug (there was one)
the wrong pointer may be used and the reference can cause an oops.
This patch makes t_next() and friends only return the iterator structure
which now has a pointer of type struct dyn_ftrace and struct
ftrace_func_probe. The t_show() can now test if the pointer is NULL or
not and if the pointer exists, it is guaranteed to be of the correct type.
Now if there's a bug, only wrong data will be shown but not an oops.
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the filtered functions are read, the probed functions are read
from the hash in set_ftrace_filter. When the hashed probed functions
are read, the *pos passed in is reset. Instead of modifying the pos
given to the read function, just record the pos where the filtered
functions ended and subtract from that.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread
perf symbols: Fix multiple initialization of symbol system
perf: Fix CPU hotplug
perf, trace: Fix module leak
tracing/kprobe: Fix handling of C-unlike argument names
tracing/kprobes: Fix handling of argument names
perf probe: Fix handling of arguments names
perf probe: Fix return probe support
tracing/kprobe: Fix a memory leak in error case
tracing: Do not allow llseek to set_ftrace_filter
Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without
having properly started the iterator to iterate the hash. This case is
degenerate and, as discovered by Robert Swiecki, can cause t_hash_show()
to misuse a pointer. This causes a NULL ptr deref with possible security
implications. Tracked as CVE-2010-3079.
Cc: Robert Swiecki <swiecki@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Replace pmu::{enable,disable,start,stop,unthrottle} with
pmu::{add,del,start,stop}, all of which take a flags argument.
The new interface extends the capability to stop a counter while
keeping it scheduled on the PMU. We replace the throttled state with
the generic stopped state.
This also allows us to efficiently stop/start counters over certain
code paths (like IRQ handlers).
It also allows scheduling a counter without it starting, allowing for
a generic frozen state (useful for rotating stopped counters).
The stopped state is implemented in two different ways, depending on
how the architecture implemented the throttled state:
1) We disable the counter:
a) the pmu has per-counter enable bits, we flip that
b) we program a NOP event, preserving the counter state
2) We store the counter state and ignore all read/overflow events
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus <paulus@samba.org>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Yanmin <yanmin_zhang@linux.intel.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Michael Cree <mcree@orcon.net.nz>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
gcc-4.6: kernel/*: Fix unused but set warnings
mutex: Fix annotations to include it in kernel-locking docbook
pid: make setpgid() system call use RCU read-side critical section
MAINTAINERS: Add RCU's public git tree
Check the argument name whether it is invalid (not C-like symbol name). This
makes event format simple.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100827113912.22882.62313.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Set "argN" name for each argument automatically if it has no specified name.
Since dynamic trace event(kprobe_events) accepts special characters for its
argument, its format can show those special characters (e.g. '$', '%', '+').
However, perf can't parse those format because of the character (especially
'%') mess up the format. This sets "argX" name for those arguments if user
omitted the argument names.
E.g.
# echo 'p do_fork %ax IP=%ip $stack' > tracing/kprobe_events
# cat tracing/kprobe_events
p:kprobes/p_do_fork_0 do_fork arg1=%ax IP=%ip arg3=$stack
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100827113906.22882.59312.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Fix a memory leak which happens when a field name conflicts with others. In
error case, free_trace_probe() will free all arguments until nr_args, so this
increments nr_args the begining of the loop instead of the end.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100827113846.22882.12670.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reading the file set_ftrace_filter does three things.
1) shows whether or not filters are set for the function tracer
2) shows what functions are set for the function tracer
3) shows what triggers are set on any functions
3 is independent from 1 and 2.
The way this file currently works is that it is a state machine,
and as you read it, it may change state. But this assumption breaks
when you use lseek() on the file. The state machine gets out of sync
and the t_show() may use the wrong pointer and cause a kernel oops.
Luckily, this will only kill the app that does the lseek, but the app
dies while holding a mutex. This prevents anyone else from using the
set_ftrace_filter file (or any other function tracing file for that matter).
A real fix for this is to rewrite the code, but that is too much for
a -rc release or stable. This patch simply disables llseek on the
set_ftrace_filter() file for now, and we can do the proper fix for the
next major release.
Reported-by: Robert Swiecki <swiecki@google.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Cc: vendor-sec@lst.de
Cc: <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
No real bugs I believe, just some dead code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: andi@firstfloor.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With a new enough GCC, ARM function tracing can be supported without the
need for frame pointers. This is essential for Thumb-2 support, since
frame pointers aren't available then.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
While discussing the strictness of the 80 character limit on the
Kernel Summit Discussion mailing list, I showed examples that I
broke that limit slightly with some algorithms. In discussing with
John Linville, what looked better, I realized that two of the
80 char breaking culprits were an identical expression.
As a clean up, this patch moves the identical expression into its
own helper function and that is used instead. As a side effect,
the offending code is now under the 80 character limit. :-)
This clean up code also changes the expression from
(A - B) - C to A - (B + C)
This makes the code look a little nicer too.
Cc: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While we are reading trace_stat/functionX and someone just
disabled function_profile at that time, we can trigger this:
divide error: 0000 [#1] PREEMPT SMP
...
EIP is at function_stat_show+0x90/0x230
...
This fix just takes the ftrace_profile_lock and checks if
rec->counter is 0. If it's 0, we know the profile buffer
has been reset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: stable@kernel.org
LKML-Reference: <4C723644.4040708@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of hardcoding the number of contexts for the recursions
barriers, define a cpp constant to make the code more
self-explanatory.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Remove the nasty hack that marks a pointer's LSB to distinguish common
fields from event fields. Replace it with a more sane approach.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4C6A23C2.9020606@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Two new events were added that broke the current format output.
Both from the SCSI system: scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout
The reason is that their print_fmt exceeded a page size. Since the output
of the format used simple_read_from_buffer and trace_seq, it was limited
to a page size in output.
This patch converts the printing of the format of an event into seq_file,
which allows greater than a page size to be shown.
I diffed all event formats comparing the output with and without this
patch. All matched except for the above two, which showed just:
FORMAT TOO BIG
without this patch, but now properly displays the output with this patch.
v2: Remove updating *pos in seq start function.
[ Thanks to Li Zefan for pointing that out ]
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
* 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
do_coredump: Do not take BKL
init: Remove the BKL from startup code
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
workqueue: mark init_workqueues() as early_initcall()
workqueue: explain for_each_*cwq_cpu() iterators
fscache: fix build on !CONFIG_SYSCTL
slow-work: kill it
gfs2: use workqueue instead of slow-work
drm: use workqueue instead of slow-work
cifs: use workqueue instead of slow-work
fscache: drop references to slow-work
fscache: convert operation to use workqueue instead of slow-work
fscache: convert object to use workqueue instead of slow-work
workqueue: fix how cpu number is stored in work->data
workqueue: fix mayday_mask handling on UP
workqueue: fix build problem on !CONFIG_SMP
workqueue: fix locking in retry path of maybe_create_worker()
async: use workqueue for worker pool
workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
workqueue: implement unbound workqueue
workqueue: prepare for WQ_UNBOUND implementation
libata: take advantage of cmwq and remove concurrency limitations
workqueue: fix worker management invocation without pending works
...
Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
The blktrace driver currently needs the BKL, but
we should not need to take that in the block layer,
so just push it down into the driver itself.
It is quite likely that the BKL is not actually
required in blktrace code and could be removed
in a follow-on patch.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests. This allows much easier grepping for different request
types instead of unwinding through macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
um: Fix read_persistent_clock fallout
kgdb: Do not access xtime directly
powerpc: Clean up obsolete code relating to decrementer and timebase
powerpc: Rework VDSO gettimeofday to prevent time going backwards
clocksource: Add __clocksource_updatefreq_hz/khz methods
x86: Convert common clocksources to use clocksource_register_hz/khz
timekeeping: Make xtime and wall_to_monotonic static
hrtimer: Cleanup direct access to wall_to_monotonic
um: Convert to use read_persistent_clock
timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset
powerpc: Cleanup xtime usage
powerpc: Simplify update_vsyscall
time: Kill off CONFIG_GENERIC_TIME
time: Implement timespec_add
x86: Fix vtime/file timestamp inconsistencies
Trivial conflicts in Documentation/feature-removal-schedule.txt
Much less trivial conflicts in arch/powerpc/kernel/time.c resolved as
per Thomas' earlier merge commit 47916be4e2 ("Merge branch
'powerpc.cherry-picks' into timers/clocksource")
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
sched: No need for bootmem special cases
sched: Revert nohz_ratelimit() for now
sched: Reduce update_group_power() calls
sched: Update rq->clock for nohz balanced cpus
sched: Fix spelling of sibling
sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
sched: thread_group_cputime: Simplify, document the "alive" check
sched: Remove the obsolete exit_state/signal hacks
sched: task_tick_rt: Remove the obsolete ->signal != NULL check
sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
sched: Fix comments to make them DocBook happy
sched: Fix fix_small_capacity
powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
powerpc: Enable asymmetric SMT scheduling on POWER7
sched: Add asymmetric group packing option for sibling domain
sched: Fix capacity calculations for SMT4
sched: Change nohz idle load balancing logic to push model
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits)
tracing/kprobes: unregister_trace_probe needs to be called under mutex
perf: expose event__process function
perf events: Fix mmap offset determination
perf, powerpc: fsl_emb: Restore setting perf_sample_data.period
perf, powerpc: Convert the FSL driver to use local64_t
perf tools: Don't keep unreferenced maps when unmaps are detected
perf session: Invalidate last_match when removing threads from rb_tree
perf session: Free the ref_reloc_sym memory at the right place
x86,mmiotrace: Add support for tracing STOS instruction
perf, sched migration: Librarize task states and event headers helpers
perf, sched migration: Librarize the GUI class
perf, sched migration: Make the GUI class client agnostic
perf, sched migration: Make it vertically scrollable
perf, sched migration: Parameterize cpu height and spacing
perf, sched migration: Fix key bindings
perf, sched migration: Ignore unhandled task states
perf, sched migration: Handle ignored migrate out events
perf: New migration tool overview
tracing: Drop cpparg() macro
perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call
...
Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c
With CONFIG_DEBUG_PAGEALLOC, I observed an unallocated memory access in
function_graph trace. It appears we find a small size entry in ring buffer,
but we access it as a big size entry. The access overflows the page size
and touches an unallocated page.
Cc: <stable@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <1280217994.32400.76.camel@sli10-desk.sh.intel.com>
[ Added a comment to explain the problem - SDR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In systems with more than one processor it is desirable to look at the
per cpu trace buffers.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Add in a helper function to allow the kdb shell to dump the ftrace
buffer.
Modify trace.c to expose the capability to iterate over the ftrace
buffer in a read only capacity.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Comment in unregister_trace_probe() says probe_lock will be held when it
gets called. However there is a case where it might called without the
probe_lock being held. Also since we are traversing the probe_list and
deleting an element from the probe_list, probe_lock should be held.
This was first pointed in uprobes traceevent review by Frederic
Weisbecker here. (http://lkml.org/lkml/2010/5/12/106)
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100630084548.GA10325@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We use synchronize_sched() to ensure a tracepoint won't be called
while/after we release the perf buffers it references.
But the tracepoint API has its own API for that:
tracepoint_synchronize_unregister(). Use it instead as it's
self-explanatory and eases maintainance.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Now that all arches have been converted over to use generic time via
clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME
config option and simplify the generic code.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We need to add one to the strlen() return because of the NULL
character. The type->name here generally comes from the kernel and I
don't think any of them come close to being MAX_TRACER_SIZE (100)
characters long so this is basically a cleanup.
Signed-off-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <20100710100644.GV19184@bicker>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Documentation/trace/ftrace.txt says
buffer_size_kb:
This sets or displays the number of kilobytes each CPU
buffer can hold. The tracer buffers are the same size
for each CPU. The displayed number is the size of the
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
that the kernel uses for allocation, usually 4 KB in size).
If the last page allocated has room for more bytes
than requested, the rest of the page will be used,
making the actual allocation bigger than requested.
( Note, the size may not be a multiple of the page size
due to buffer management overhead. )
This can only be updated when the current_tracer
is set to "nop".
But it's incorrect. currently total memory consumption is
'buffer_size_kb x CPUs x 2'.
Why two times difference is there? because ftrace implicitly allocate
the buffer for max latency too.
That makes sad result when admin want to use large buffer. (If admin
want full logging and makes detail analysis). example, If admin
have 24 CPUs machine and write 200MB to buffer_size_kb, the system
consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is
usually unacceptable.
Fortunatelly, almost all users don't use max latency feature.
The max latency buffer can be disabled easily.
This patch shrink buffer size of the max latency buffer if
unnecessary.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__print_flags() and __print_symbolic() use percpu trace_seq:
1) Its memory is allocated at compile time, it wastes memory if we don't use tracing.
2) It is percpu data and it wastes more memory for multi-cpus system.
3) It disables preemption when it executes its core routine
"trace_seq_printf(s, "%s: ", #call);" and introduces latency.
So we move this trace_seq to struct trace_iterator.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4C078350.7090106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reorder structure to remove 8 bytes of padding on 64 bit builds.
This shrinks the size to 128 bytes so allowing allocation from a smaller
slab & needed one fewer cache lines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
LKML-Reference: <1269516456.2054.8.camel@localhost>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We found that even enabling a single trace event that will rarely be
triggered can add big overhead to context switch.
(lmbench context switch test)
-------------------------------------------------
2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
------ ------ ------ ------ ------ ------- -------
2.19 2.3 2.21 2.56 2.13 2.54 2.07
2.39 2.51 2.35 2.75 2.27 2.81 2.24
The overhead is 6% ~ 11%.
It's because when a trace event is enabled 3 tracepoints (sched_switch,
sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname.
We'd like to avoid this overhead, so add a trace option '(no)record-cmd'
to allow to disable cmdline recording.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The default for llseek will change to no_llseek,
so the tracing debugfs files need to add explicit
.llseek assignments. Since we're dealing with regular
files from a VFS perspective, use generic_file_llseek.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1278538820-1392-10-git-send-email-arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Special traces type was only used by sysprof. Lets remove it now
that sysprof ftrace plugin has been dropped.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
The sysprof ftrace plugin doesn't seem to be seriously used
somewhere. There is a branch in the sysprof tree that makes
an interface to it, but the real sysprof tool uses either its
own module or perf events.
Drop the sysprof ftrace plugin then, as it's mostly useless.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
The ksym (breakpoint) ftrace plugin has been superseded by perf
tools that are much more poweful to use the cpu breakpoints.
This tracer doesn't bring more feature. It has been deprecated
for a while now, lets remove it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
I have shown by code review that no driver takes
the BKL at init time any more, so whatever the
init code was locking against is no longer there
and it is now safe to remove the BKL there.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Steven Rostedt <rostedt@goodmis>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Support string type tracing and printing in kprobe-tracer.
This allows user to trace string data in kernel including __user data. Note
that sometimes __user data may not be accessed if it is paged-out (sorry, but
kprobes operation should be done in atomic, we can not wait for page-in).
Commiter note: Fixed up conflicts with b7e2ece.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Strip tracing code from workqueue and remove workqueue tracing. This
is temporary measure till concurrency managed workqueue is complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Because kprobes and syscalls need special processing to register
events, the class->reg() method was created to handle the differences.
But instead of creating a default ->reg for perf and ftrace events,
the code was scattered with:
if (class->reg)
class->reg();
else
default_reg();
This is messy and can also lead to bugs.
This patch cleans up this code and creates a default reg() entry for
the events allowing for the code to directly call the class->reg()
without the condition.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The nsecs_str string is a local variable defined as:
char nsecs_str[5];
It is possible for the snprintf call to use a size value larger than the
size of the string. This should not cause a buffer overrun as it is
written now due to the value for the string format "%03lu" can not be
larger than 1000. However, this change makes it correct. By making the
size correct we guard against potential future changes that could actually
cause a buffer overrun.
Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1276619355-18116-1-git-send-email-chase.douglas@canonical.com>
[ added 'UL' to number 8 to fix gcc warning comparing it to sizeof() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Let trace_module_add_events() and event_trace_init() call
__trace_add_event_call().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA37E9.1020106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
raw_init callback is optional.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA37D4.7070500@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every event (or event class) has it's define_fields callback,
so the test is redundant.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA37BC.8080707@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every event has the same common fields, so it's a big waste of
memory to have a copy of those fields for every event.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA3759.30105@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All syscall exit events have the same fields.
The kernel size drops 2.5K:
text data bss dec hex filename
7018612 2034376 7251132 16304120 f8c7f8 vmlinux.o.orig
7018612 2031888 7251132 16301632 f8be40 vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA3746.8070100@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the addition of the code to shrink the kernel tracepoint
infrastructure, we lost kprobes being traced by perf. The reason
is that I tested if the "tp_event->class->perf_probe" existed before
enabling it. This prevents "ftrace only" events (like the function
trace events) from being enabled by perf.
Unfortunately, kprobe events do not use perf_probe. This causes
kprobes to be missed by perf. To fix this, we add the test to
see if "tp_event->class->reg" exists as well as perf_probe.
Normal trace events have only "perf_probe" but no "reg" function,
and kprobes and syscalls have the "reg" but no "perf_probe".
The ftrace unique events do not have either, so this is a valid
test. If a kprobe or syscall is not to be probed by perf, the
"reg" function is called anyway, and will return a failure and
prevent perf from probing it.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We have been resisting new ftrace plugins and removing existing
ones, and kmemtrace has been superseded by kmem trace events
and perf-kmem, so we remove it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
[ remove kmemtrace from the makefile, handle slob too ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
For people who otherwise get to write: cpu_clock(smp_processor_id()),
there is now: local_clock().
Also, as per suggestion from Andrew, provide some documentation on
the various clock interfaces, and minimize the unsigned long long vs
u64 mess.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
LKML-Reference: <1275052414.1645.52.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The boot tracer is useless. It simply logs the initcalls
but in fact these initcalls are also logged through printk
while using the initcall_debug kernel parameter.
Nobody seem to be using it so far. Then just remove it.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Chase Douglas <chase.douglas@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <20100526105753.GA5677@cr0.nay.redhat.com>
[ remove the hooks in main.c, and the headers ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Drop this argument now that we always want to rewind only to the
state of the first caller.
It means frame pointers are not necessary anymore to reliably get
the source of an event. But this also means we need this helper
to be a macro now, as an inline function is not an option since
we need to know when to provide a default implentation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
arch/x86/include/asm/stacktrace.h and arch/x86/kernel/dumpstack.h
declare headers of objects that deal with the same topic.
Actually most of the files that include stacktrace.h also include
dumpstack.h
Although dumpstack.h seems more reserved for internals of stack
traces, those are quite often needed to define specialized stack
trace operations. And perf event arch headers are going to need
access to such low level operations anyway. So don't continue to
bother with dumpstack.h as it's not anymore about isolated deep
internals.
v2: fix struct stack_frame definition conflict in sysprof
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Soeren Sandmann <sandmann@daimi.au.dk>
The ftrace_preempt_disable/enable functions were to address a
recursive race caused by the function tracer. The function tracer
traces all functions which makes it easily susceptible to recursion.
One area was preempt_enable(). This would call the scheduler and
the schedulre would call the function tracer and loop.
(So was it thought).
The ftrace_preempt_disable/enable was made to protect against recursion
inside the scheduler by storing the NEED_RESCHED flag. If it was
set before the ftrace_preempt_disable() it would not call schedule
on ftrace_preempt_enable(), thinking that if it was set before then
it would have already scheduled unless it was already in the scheduler.
This worked fine except in the case of SMP, where another task would set
the NEED_RESCHED flag for a task on another CPU, and then kick off an
IPI to trigger it. This could cause the NEED_RESCHED to be saved at
ftrace_preempt_disable() but the IPI to arrive in the the preempt
disabled section. The ftrace_preempt_enable() would not call the scheduler
because the flag was already set before entring the section.
This bug would cause a missed preemption check and cause lower latencies.
Investigating further, I found that the recusion caused by the function
tracer was not due to schedule(), but due to preempt_schedule(). Now
that preempt_schedule is completely annotated with notrace, the recusion
no longer is an issue.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix blktrace.c kernel-doc warnings:
Warning(kernel/trace/blktrace.c:858): No description found for parameter 'ignore'
Warning(kernel/trace/blktrace.c:890): No description found for parameter 'ignore'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100529114507.c466fc1e.randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steve spotted I forgot to do the destroy under event_mutex.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1274451913.1674.1707.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tracepoint_probe_unregister() does not synchronize against the probe
callbacks, so do that explicitly. This properly serializes the callbacks
and the free of the data used therein.
Also, use this_cpu_ptr() where possible.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1274438476.1674.1702.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tui: Fix last use_browser problem related to .perfconfig
perf symbols: Add the build id cache to the vmlinux path
perf tui: Reset use_browser if stdout is not a tty
ring-buffer: Move zeroing out excess in page to ring buffer code
ring-buffer: Reset "real_end" when page is filled
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits)
tracing: Add __used annotation to event variable
perf, trace: Fix !x86 build bug
perf report: Support multiple events on the TUI
perf annotate: Fix up usage of the build id cache
x86/mmiotrace: Remove redundant instruction prefix checks
perf annotate: Add TUI interface
perf tui: Remove annotate from popup menu after failure
perf report: Don't start the TUI if -D is used
perf: Fix getline undeclared
perf: Optimize perf_tp_event_match()
perf: Remove more code from the fastpath
perf: Optimize the !vmalloc backed buffer
perf: Optimize perf_output_copy()
perf: Fix wakeup storm for RO mmap()s
perf-record: Share per-cpu buffers
perf-record: Remove -M
perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers
perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events
perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction
perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig
...
Currently the trace splice code zeros out the excess bytes in the page before
sending it off to userspace.
This is to make sure userspace is not getting anything it should not be
when reading the pages, because the excess data was never initialized
to zero before writing (for perfomance reasons).
But the splice code has no business in doing this work, it should be
done by the ring buffer. With the latest changes for recording lost
events, the splice code gets it wrong anyway.
Move the zeroing out of excess bytes into the ring buffer code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code to store the "lost events" requires knowing the real end
of the page. Since the 'commit' includes the padding at the end of
a page a "real_end" variable was used to keep track of the end not
including the padding.
If events were lost, the reader can place the count of events in
the padded area if there is enough room.
The bug this patch fixes is that when we fill the page we do not
reset the real_end variable, and if the writer had wrapped a few
times, the real_end would be incorrect.
This patch simply resets the real_end if the page was filled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Patch b7e2ecef92 (perf, trace: Optimize tracepoints by removing
IRQ-disable from perf/tracepoint interaction) made the
unfortunate mistake of assuming the world is x86 only, correct
this.
The problem was that perf_fetch_caller_regs() did
local_save_flags() into regs->flags, and I re-used that to
remove another local_save_flags(), forgetting !x86 doesn't have
regs->flags.
Do the reverse, remove the local_save_flags() from
perf_fetch_caller_regs() and let the ftrace site do the
local_save_flags() instead.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: acme@redhat.com
Cc: efault@gmx.de
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
LKML-Reference: <1274778175.5882.623.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (182 commits)
[SCSI] aacraid: add an ifdef'd device delete case instead of taking the device offline
[SCSI] aacraid: prohibit access to array container space
[SCSI] aacraid: add support for handling ATA pass-through commands.
[SCSI] aacraid: expose physical devices for models with newer firmware
[SCSI] aacraid: respond automatically to volumes added by config tool
[SCSI] fcoe: fix fcoe module ref counting
[SCSI] libfcoe: FIP Keep-Alive messages for VPorts are sent with incorrect port_id and wwn
[SCSI] libfcoe: Fix incorrect MAC address clearing
[SCSI] fcoe: fix a circular locking issue with rtnl and sysfs mutex
[SCSI] libfc: Move the port_id into lport
[SCSI] fcoe: move link speed checking into its own routine
[SCSI] libfc: Remove extra pointer check
[SCSI] libfc: Remove unused fc_get_host_port_type
[SCSI] fcoe: fixes wrong error exit in fcoe_create
[SCSI] libfc: set seq_id for incoming sequence
[SCSI] qla2xxx: Updates to ISP82xx support.
[SCSI] qla2xxx: Optionally disable target reset.
[SCSI] qla2xxx: ensure flash operation and host reset via sg_reset are mutually exclusive
[SCSI] qla2xxx: Silence bogus warning by gcc for wrap and did.
[SCSI] qla2xxx: T10 DIF support added.
...
Avoid the swevent hash-table by using per-tracepoint
hlists.
Also, avoid conditionals on the fast path by ordering
with probe unregister so that we should never get on
the callback path without the data being there.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100521090710.473188012@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we've got but a single event per tracepoint
there is no reason to try and multiplex it so don't.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix "integer as NULL pointer" warning.
tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header
tracing: Make the documentation clear on trace_event boot option
ring-buffer: Wrap open-coded WARN_ONCE
tracing: Convert nop macros to static inlines
tracing: Fix sleep time function profiling
tracing: Show sample std dev in function profiling
tracing: Add documentation for trace commands mod, traceon/traceoff
ring-buffer: Make benchmark handle missed events
ring-buffer: Make non-consuming read less expensive with lots of cpus.
tracing: Add graph output support for irqsoff tracer
tracing: Have graph flags passed in to ouput functions
tracing: Add ftrace events for graph tracer
tracing: Dump either the oops's cpu source or all cpus buffers
tracing: Fix uninitialized variable of tracing/trace output
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
sched, wait: Use wrapper functions
sched: Remove a stale comment
ondemand: Make the iowait-is-busy time a sysfs tunable
ondemand: Solve a big performance issue by counting IOWAIT time as busy
sched: Intoduce get_cpu_iowait_time_us()
sched: Eliminate the ts->idle_lastupdate field
sched: Fold updating of the last_update_time_info into update_ts_time_stats()
sched: Update the idle statistics in get_cpu_idle_time_us()
sched: Introduce a function to update the idle statistics
sched: Add a comment to get_cpu_idle_time_us()
cpu_stop: add dummy implementation for UP
sched: Remove rq argument to the tracepoints
rcu: need barrier() in UP synchronize_sched_expedited()
sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
sched: kill paranoia check in synchronize_sched_expedited()
sched: replace migration_thread with cpu_stop
stop_machine: reimplement using cpu_stop
cpu_stop: implement stop_cpu[s]()
sched: Fix select_idle_sibling() logic in select_task_rq_fair()
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
perf tools: Add mode to build without newt support
perf symbols: symbol inconsistency message should be done only at verbose=1
perf tui: Add explicit -lslang option
perf options: Type check all the remaining OPT_ variants
perf options: Type check OPT_BOOLEAN and fix the offenders
perf options: Check v type in OPT_U?INTEGER
perf options: Introduce OPT_UINTEGER
perf tui: Add workaround for slang < 2.1.4
perf record: Fix bug mismatch with -c option definition
perf options: Introduce OPT_U64
perf tui: Add help window to show key associations
perf tui: Make <- exit menus too
perf newt: Add single key shortcuts for zoom into DSO and threads
perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
perf newt: Fix the 'A'/'a' shortcut for annotate
perf newt: Make <- exit the ui_browser
x86, perf: P4 PMU - fix counters management logic
perf newt: Make <- zoom out filters
perf report: Report number of events, not samples
perf hist: Clarify events_stats fields usage
...
Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
ftrace_trace_stack() and frace_trace_userstacke() take a
struct ring_buffer argument, not struct trace_array. Commit
e77405ad("tracing: pass around ring buffer instead of tracer")
made this change.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BE77C14.5010806@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The filter_active and enable both use an int (4 bytes each) to
set a single flag. We can save 4 bytes per event by combining the
two into a single integer.
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4894944 1018052 861512 6774508 675eec vmlinux.id
4894871 1012292 861512 6768675 674823 vmlinux.flags
This gives us another 5K in savings.
The modification of both the enable and filter fields are done
under the event_mutex, so it is still safe to combine the two.
Note: Although Mathieu gave his Acked-by, he would like it documented
that the reads of flags are not protected by the mutex. The way the
code works, these reads will not break anything, but will have a
residual effect. Since this behavior is the same even before this
patch, describing this situation is left to another patch, as this
patch does not change the behavior, but just brought it to Mathieu's
attention.
v2: Updated the event trace self test to for this change.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that the trace_event structure is embedded in the ftrace_event_call
structure, there is no need for the ftrace_event_call id field.
The id field is the same as the trace_event type field.
Removing the id and re-arranging the structure brings down the tracepoint
footprint by another 5K.
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4895024 1023812 861512 6780348 6775bc vmlinux.print
4894944 1018052 861512 6774508 675eec vmlinux.id
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, every event has its own trace_event structure. This is
fine since the structure is needed anyway. But the print function
structure (trace_event_functions) is now separate. Since the output
of the trace event is done by the class (with the exception of events
defined by DEFINE_EVENT_PRINT), it makes sense to have the class
define the print functions that all events in the class can use.
This makes a bigger deal with the syscall events since all syscall events
use the same class. The savings here is another 30K.
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4900382 1048964 861512 6810858 67ecea vmlinux.init
4900446 1049028 861512 6810986 67ed6a vmlinux.preprint
4895024 1023812 861512 6780348 6775bc vmlinux.print
To accomplish this, and to let the class know what event is being
printed, the event structure is embedded in the ftrace_event_call
structure. This should not be an issues since the event structure
was created for each event anyway.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Multiple events may use the same method to print their data.
Instead of having all events have a pointer to their print funtions,
the trace_event structure now points to a trace_event_functions structure
that will hold the way to print ouf the event.
The event itself is now passed to the print function to let the print
function know what kind of event it should print.
This opens the door to consolidating the way several events print
their output.
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4900382 1048964 861512 6810858 67ecea vmlinux.init
4900446 1049028 861512 6810986 67ed6a vmlinux.preprint
This change slightly increases the size but is needed for the next change.
v3: Fix the branch tracer events to handle this change.
v2: Fix the new function graph tracer event calls to handle this change.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The raw_init function pointer in the event is used to initialize
various kinds of events. The type of initialization needed is usually
classed to the kind of event it is.
Two events with the same class will always have the same initialization
function, so it makes sense to move this to the class structure.
Perhaps even making a special system structure would work since
the initialization is the same for all events within a system.
But since there's no system structure (yet), this will just move it
to the class.
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4900375 1053380 861512 6815267 67fe23 vmlinux.fields
4900382 1048964 861512 6810858 67ecea vmlinux.init
The text grew very slightly, but this is a constant growth that happened
with the changing of the C files that call the init code.
The bigger savings is the data which will be saved the more events share
a class.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the defined fields from the event to the class structure.
Since the fields of the event are defined by the class they belong
to, it makes sense to have the class hold the information instead
of the individual events. The events of the same class would just
hold duplicate information.
After this change the size of the kernel dropped another 3K:
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4900252 1057412 861512 6819176 680d68 vmlinux.regs
4900375 1053380 861512 6815267 67fe23 vmlinux.fields
Although the text increased, this was mainly due to the C files
having to adapt to the change. This is a constant increase, where
new tracepoints will not increase the Text. But the big drop is
in the data size (as well as needed allocations to hold the fields).
This will give even more savings as more tracepoints are created.
Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS()
with several DEFINE_EVENT()s, then the savings will be lost. But
we are pushing developers to consolidate events with DEFINE_EVENT()
so this should not be an issue.
The kprobes define a unique class to every new event, but are dynamic
so it should not be a issue.
The syscalls however have a single class but the fields for the individual
events are different. The syscalls use a metadata to define the
fields. I moved the fields list from the event to the metadata and
added a "get_fields()" function to the class. This function is used
to find the fields. For normal events and kprobes, get_fields() just
returns a pointer to the fields list_head in the class. For syscall
events, it returns the fields list_head in the metadata for the event.
v2: Fixed the syscall fields. The syscall metadata needs a list
of fields for both enter and exit.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch removes the register functions of TRACE_EVENT() to enable
and disable tracepoints. The registering of a event is now down
directly in the trace_events.c file. The tracepoint_probe_register()
is now called directly.
The prototypes are no longer type checked, but this should not be
an issue since the tracepoints are created automatically by the
macros. If a prototype is incorrect in the TRACE_EVENT() macro, then
other macros will catch it.
The trace_event_class structure now holds the probes to be called
by the callbacks. This removes needing to have each event have
a separate pointer for the probe.
To handle kprobes and syscalls, since they register probes in a
different manner, a "reg" field is added to the ftrace_event_class
structure. If the "reg" field is assigned, then it will be called for
enabling and disabling of the probe for either ftrace or perf. To let
the reg function know what is happening, a new enum (trace_reg) is
created that has the type of control that is needed.
With this new rework, the 82 kernel events and 618 syscall events
has their footprint dramatically lowered:
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4914025 1088868 861512 6864405 68be15 vmlinux.class
4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint
4900252 1057412 861512 6819176 680d68 vmlinux.regs
The size went from 6863829 to 6819176, that's a total of 44K
in savings. With tracepoints being continuously added, this is
critical that the footprint becomes minimal.
v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf
specific structure in trace_events.c.
v4: Fixed trace self tests to check probe because regfunc no longer
exists.
v3: Updated to handle void *data in beginning of probe parameters.
Also added the tracepoint: check_trace_callback_type_##call().
v2: Changed the callback probes to pass void * and typecast the
value within the function.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds data to be passed to tracepoint callbacks.
The created functions from DECLARE_TRACE() now need a mandatory data
parameter. For example:
DECLARE_TRACE(mytracepoint, int value, value)
Will create the register function:
int register_trace_mytracepoint((void(*)(void *data, int value))probe,
void *data);
As the first argument, all callbacks (probes) must take a (void *data)
parameter. So a callback for the above tracepoint will look like:
void myprobe(void *data, int value)
{
}
The callback may choose to ignore the data parameter.
This change allows callbacks to register a private data pointer along
with the function probe.
void mycallback(void *data, int value);
register_trace_mytracepoint(mycallback, mydata);
Then the mycallback() will receive the "mydata" as the first parameter
before the args.
A more detailed example:
DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));
/* In the C file */
DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));
[...]
trace_mytracepoint(status);
/* In a file registering this tracepoint */
int my_callback(void *data, int status)
{
struct my_struct my_data = data;
[...]
}
[...]
my_data = kmalloc(sizeof(*my_data), GFP_KERNEL);
init_my_data(my_data);
register_trace_mytracepoint(my_callback, my_data);
The same callback can also be registered to the same tracepoint as long
as the data registered is different. Note, the data must also be used
to unregister the callback:
unregister_trace_mytracepoint(my_callback, my_data);
Because of the data parameter, tracepoints declared this way can not have
no args. That is:
DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS());
will cause an error.
If no arguments are needed, a new macro can be used instead:
DECLARE_TRACE_NOARGS(mytracepoint);
Since there are no arguments, the proto and args fields are left out.
This is part of a series to make the tracepoint footprint smaller:
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4914025 1088868 861512 6864405 68be15 vmlinux.class
4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint
Again, this patch also increases the size of the kernel, but
lays the ground work for decreasing it.
v5: Fixed net/core/drop_monitor.c to handle these updates.
v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the
#ifdef CONFIG_TRACE_POINTS, since the two are the same in both
cases. The __DECLARE_TRACE() is what changes.
Thanks to Frederic Weisbecker for pointing this out.
v3: Made all register_* functions require data to be passed and
all callbacks to take a void * parameter as its first argument.
This makes the calling functions comply with C standards.
Also added more comments to the modifications of DECLARE_TRACE().
v2: Made the DECLARE_TRACE() have the ability to pass arguments
and added a new DECLARE_TRACE_NOARGS() for tracepoints that
do not need any arguments.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch creates a ftrace_event_class struct that event structs point to.
This class struct will be made to hold information to modify the
events. Currently the class struct only holds the events system name.
This patch slightly increases the size, but this change lays the ground work
of other changes to make the footprint of tracepoints smaller.
With 82 standard tracepoints, and 618 system call tracepoints
(two tracepoints per syscall: enter and exit):
text data bss dec hex filename
4913961 1088356 861512 6863829 68bbd5 vmlinux.orig
4914025 1088868 861512 6864405 68be15 vmlinux.class
This patch also cleans up some stale comments in ftrace.h.
v2: Fixed missing semi-colon in macro.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
struct rq isn't visible outside of sched.o so its near useless to
expose the pointer, also there are no users of it, so remove it.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1272997616.1642.207.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original code doesn't work because "call" is never NULL there.
Signed-off-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <20100320143911.GF5331@bicker>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The breakpoint generic layer assumes that archs always know in advance
the static number of address registers available to host breakpoints
through the HBP_NUM macro.
However this is not true for every archs. For example Arm needs to get
this information dynamically to handle the compatiblity between
different versions.
To solve this, this patch proposes to drop the static HBP_NUM macro
and let the arch provide the number of available slots through a
new hw_breakpoint_slots() function. For archs that have
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS selected, it will be called once
as the number of registers fits for instruction and data breakpoints
together.
For the others it will be called first to get the number of
instruction breakpoint registers and another time to get the
data breakpoint registers, the targeted type is given as a
parameter of hw_breakpoint_slots().
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
__print_hex() prints values in an array in hex (w/o '0x') (space separated)
EX) 92 33 32 f3 ee 4d
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
When sleep_time is off the function profiler ignores the time that a task
is scheduled out. When the task is scheduled out a timestamp is taken.
When the task is scheduled back in, the timestamp is compared to the
current time and the saved calltimes are adjusted accordingly.
But when stopping the function profiler, the sched switch hook that
does this adjustment was stopped before shutting down the tracer.
This allowed some tasks to not get their timestamps set when they
scheduled out. When the function profiler started again, this would
skew the times of the scheduler functions.
This patch moves the stopping of the sched switch to after the function
profiler is stopped. It also ignores zero set calltimes, which may
happen on start up.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When combined with function graph tracing the ftrace function profiler
also prints the average run time of functions. While this gives us some
good information, it doesn't tell us anything about the variance of the
run times of the function. This change prints out the s^2 sample
standard deviation alongside the average.
This change adds one entry to the profile record structure. This
increases the memory footprint of the function profiler by 1/3 on a
32-bit system, and by 1/5 on a 64-bit system when function graphing is
enabled, though the memory is only allocated when the profiler is turned
on. During the profiling, one extra line of code adds the squared
calltime to the new record entry, so this should not adversly affect
performance.
Note that the square of the sample standard deviation is printed because
there is no sqrt implementation for unsigned long long in the kernel.
Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1272304925-2436-1-git-send-email-chase.douglas@canonical.com>
[ fixed comment about ns^2 -> us^2 conversion ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the addition of the "missed events" flags that is stored in the
commit field of the ring buffer page, the ring_buffer_benchmark
was not updated to handle this. If events are missed, then the
missed events flag is set in the ring buffer page, the benchmark
will count that flag as part of the size of the page and will hit the BUG()
when it tries to read beyond the page.
The solution is simply to have the ring buffer benchmark mask off
the extra bits.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When performing a non-consuming read, a synchronize_sched() is
performed once for every cpu which is actively tracing.
This is very expensive, and can make it take several seconds to open
up the 'trace' file with lots of cpus.
Only one synchronize_sched() call is actually necessary. What is
desired is for all cpus to see the disabling state change. So we
transform the existing sequence:
for_each_cpu() {
ring_buffer_read_start();
}
where each ring_buffer_start() call performs a synchronize_sched(),
into the following:
for_each_cpu() {
ring_buffer_read_prepare();
}
ring_buffer_read_prepare_sync();
for_each_cpu() {
ring_buffer_read_start();
}
wherein only the single ring_buffer_read_prepare_sync() call needs to
do the synchronize_sched().
The first phase, via ring_buffer_read_prepare(), allocates the 'iter'
memory and increments ->record_disabled.
In the second phase, ring_buffer_read_prepare_sync() makes sure this
->record_disabled state is visible fully to all cpus.
And in the final third phase, the ring_buffer_read_start() calls reset
the 'iter' objects allocated in the first phase since we now know that
none of the cpus are adding trace entries any more.
This makes openning the 'trace' file nearly instantaneous on a
sparc64 Niagara2 box with 128 cpus tracing.
Signed-off-by: David S. Miller <davem@davemloft.net>
LKML-Reference: <20100420.154711.11246950.davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add function graph output to irqsoff tracer.
The graph output is enabled by setting new 'display-graph' trace option.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1270227683-14631-4-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Let the function graph tracer have custom flags passed to its
output functions.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1270227683-14631-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add ftrace events for graph tracer, so the graph output could be shared
with other tracers.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1270227683-14631-2-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_dump_on_oops kernel parameter, sysctl and sysrq let one
dump every cpu buffers when an oops or panic happens.
It's nice when you have few cpus but it may take ages if have many,
plus you miss the real origin of the problem in all the cpu traces.
Sometimes, all you need is to dump the cpu buffer that triggered the
opps, most of the time it is our main interest.
This patch modifies ftrace_dump_on_oops to handle this choice.
The ftrace_dump_on_oops kernel parameter, when it comes alone, has
the same behaviour than before. But ftrace_dump_on_oops=orig_cpu
will only dump the buffer of the cpu that oops'ed.
Similarly, sysctl kernel.ftrace_dump_on_oops=1 and
echo 1 > /proc/sys/kernel/ftrace_dump_on_oops keep their previous
behaviour. But setting 2 jumps into cpu origin dump mode.
v2: Fix double setup
v3: Fix spelling issues reported by Randy Dunlap
v4: Also update __ftrace_dump in the selftests
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Support basic types of integer (u8, u16, u32, u64, s8, s16, s32, s64) in
kprobe tracer. With this patch, users can specify above basic types on
each arguments after ':'. If omitted, the argument type is set as
unsigned long (u32 or u64, arch-dependent).
e.g.
echo 'p account_system_time+0 hardirq_offset=%si:s32' > kprobe_events
adds a probe recording hardirq_offset in signed-32bits value on the
entry of account_system_time.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100412171708.3790.18599.stgit@localhost6.localdomain6>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Because a local variable is not initialized, I got these
when I did 'cat tracing/trace'. (not trace_pipe):
CPU:0 [LOST 18446744071579453134 EVENTS]
ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446744071579453134 EVENTS]
ps-3099 [000] 560.770221: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446612133255294080 EVENTS]
ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446744071579453134 EVENTS]
ps-3099 [000] 560.770222: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446744071579453134 EVENTS]
ps-3099 [000] 560.770222: lock_release: ffffffff816cfb98 dcache_lock
See peek_next_entry(), it does not set *lost_events when we 'cat tracing/trace'
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4BB9A929.2000303@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Always build the powerpc perf_arch_fetch_caller_regs version
perf: Always build the stub perf_arch_fetch_caller_regs version
perf, probe-finder: Build fix on Debian
perf/scripts: Tuple was set from long in both branches in python_process_event()
perf: Fix 'perf sched record' deadlock
perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels
perf, x86: Fix AMD hotplug & constraint initialization
x86: Move notify_cpu_starting() callback to a later stage
x86,kgdb: Always initialize the hw breakpoint attribute
perf: Use hot regs with software sched switch/migrate events
perf: Correctly align perf event tracing buffer
The trace event buffer used by perf to record raw sample events
is typed as an array of char and may then not be aligned to 8
by alloc_percpu().
But we need it to be aligned to 8 in sparc64 because we cast
this buffer into a random structure type built by the TRACE_EVENT()
macro to store the traces. So if a random 64 bits field is accessed
inside, it may be not under an expected good alignment.
Use an array of long instead to force the appropriate alignment, and
perform a compile time check to ensure the size in byte of the buffer
is a multiple of sizeof(long) so that its actual size doesn't get
shrinked under us.
This fixes unaligned accesses reported while using perf lock
in sparc 64.
Suggested-by: David Miller <davem@davemloft.net>
Suggested-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Currently, binary readers of the ring buffer only know where events were
lost, but not how many events were lost at that location.
This information is available, but it would require adding another
field to the sub buffer header to include it.
But when a event can not fit at the end of a sub buffer, it is written
to the next sub buffer. This means there is a good chance that the
buffer may have room to hold this counter. If it does, write
the counter at the end of the sub buffer and set another flag
in the data size field that states that this information exists.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that the ring buffer can keep track of where events are lost.
Use this information to the output of trace_pipe:
hackbench-3588 [001] 1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock
hackbench-3588 [001] 1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock
hackbench-3588 [001] 1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock
CPU:1 [LOST 673 EVENTS]
hackbench-3588 [001] 1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738
hackbench-3588 [001] 1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem
hackbench-3588 [001] 1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem
Even works with the function graph tracer:
2) ! 170.098 us | }
2) 4.036 us | rcu_irq_exit();
2) 3.657 us | idle_cpu();
2) ! 190.301 us | }
CPU:2 [LOST 2196 EVENTS]
2) 0.853 us | } /* cancel_dirty_page */
2) | remove_from_page_cache() {
2) 1.578 us | _raw_spin_lock_irq();
2) | __remove_from_page_cache() {
Note, it does not work with the iterator "trace" file, since it requires
the use of consuming the page from the ring buffer to determine how many
events were lost, which the iterator does not do.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, when the ring buffer drops events, it does not record
the fact that it did so. It does inform the writer that the event
was dropped by returning a NULL event, but it does not put in any
place holder where the event was dropped.
This is not a trivial thing to add because the ring buffer mostly
runs in overwrite (flight recorder) mode. That is, when the ring
buffer is full, new data will overwrite old data.
In a produce/consumer mode, where new data is simply dropped when
the ring buffer is full, it is trivial to add the placeholder
for dropped events. When there's more room to write new data, then
a special event can be added to notify the reader about the dropped
events.
But in overwrite mode, any new write can overwrite events. A place
holder can not be inserted into the ring buffer since there never
may be room. A reader could also come in at anytime and miss the
placeholder.
Luckily, the way the ring buffer works, the read side can find out
if events were lost or not, and how many events. Everytime a write
takes place, if it overwrites the header page (the next read) it
updates a "overrun" variable that keeps track of the number of
lost events. When a reader swaps out a page from the ring buffer,
it can record this number, perfom the swap, and then check to
see if the number changed, and take the diff if it has, which would be
the number of events dropped. This can be stored by the reader
and returned to callers of the reader.
Since the reader page swap will fail if the writer moved the head
page since the time the reader page set up the swap, this gives room
to record the overruns without worrying about races. If the reader
sets up the pages, records the overrun, than performs the swap,
if the swap succeeds, then the overrun variable has not been
updated since the setup before the swap.
For binary readers of the ring buffer, a flag is set in the header
of each sub page (sub buffer) of the ring buffer. This flag is embedded
in the size field of the data on the sub buffer, in the 31st bit (the size
can be 32 or 64 bits depending on the architecture), but only 27
bits needs to be used for the actual size (less actually).
We could add a new field in the sub buffer header to also record the
number of events dropped since the last read, but this will change the
format of the binary ring buffer a bit too much. Perhaps this change can
be made if the information on the number of events dropped is considered
important enough.
Note, the notification of dropped events is only used by consuming reads
or peeking at the ring buffer. Iterating over the ring buffer does not
keep this information because the necessary data is only available when
a page swap is made, and the iterator does not swap out pages.
Cc: Robert Richter <robert.richter@amd.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
In some error handling cases the lock is not unlocked. The return is
converted to a goto, to share the unlock at the end of the function.
A simplified version of the semantic patch that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
expression E1;
identifier f;
@@
f (...) { <+...
* spin_lock_irq (E1,...);
... when != E1
* return ...;
...+> }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
LKML-Reference: <Pine.LNX.4.64.1003291736440.21896@ask.diku.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
# echo 1 > events/enable
# echo global > trace_clock
------------[ cut here ]------------
WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190()
...
---[ end trace 3f86734a89416623 ]---
possible reason: unannotated irqs-on.
...
There's no reason to use the raw_local_irq_save() in trace_clock_global.
The local_irq_save() version is fine, and does not cause the bug in lockdep.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FA1.7030606@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align
Support for the PMU's BTS features has been upstreamed in
v2.6.32, but we still have the old and disabled ptrace-BTS,
as Linus noticed it not so long ago.
It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
regard for other uses (perf) and doesn't provide the flexibility
needed for perf either.
Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
was never used and ptrace-block-step can be implemented using a
much simpler approach.
So axe all 3000 lines of it. That includes the *locked_memory*()
APIs in mm/mlock.c as well.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Markus Metzger <markus.t.metzger@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100325135413.938004390@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The ring buffer uses 4 byte alignment while recording events into the
buffer, even on 64bit machines. This saves space when there are lots
of events being recorded at 4 byte boundaries.
The ring buffer has a zero copy method to write into the buffer, with
the reserving of space and then committing it. This may cause problems
when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and
PPC this is not an issue, but on some architectures this would cause an
out-of-alignment exception.
This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
if it is OK to use 4 byte alignments on 64 bit machines. If it is not,
it forces the ring buffer event header to be 8 bytes and not 4,
and will align the length of the data to be 8 byte aligned.
This keeps the data payload at 8 byte alignments and will allow these
machines to run without issue.
The trick to this is that the header can be either 4 bytes or 8 bytes
depending on the length of the data payload. The 4 byte header
has a length field that supports up to 112 bytes. If the length of
the data is more than 112, the length field is set to zero, and the actual
length is stored in the next 4 bytes after the header.
When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces
zero in the 4 byte header forcing the length to be stored in the 4 byte
array, even with a small data load. It also forces the length of the
data load to be 8 byte aligned. The combination of these two guarantee
that the data is always at 8 byte alignment.
Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
(on sparc64)
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
perf: Fix unexported generic perf_arch_fetch_caller_regs
perf record: Don't try to find buildids in a zero sized file
perf: export perf_trace_regs and perf_arch_fetch_caller_regs
perf, x86: Fix hw_perf_enable() event assignment
perf, ppc: Fix compile error due to new cpu notifiers
perf: Make the install relative to DESTDIR if specified
kprobes: Calculate the index correctly when freeing the out-of-line execution slot
perf tools: Fix sparse CPU numbering related bugs
perf_event: Fix oops triggered by cpu offline/online
perf: Drop the obsolete profile naming for trace events
perf: Take a hot regs snapshot for trace events
perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
perf/x86-64: Use frame pointer to walk on irq and process stacks
lockdep: Move lock events under lockdep recursion protection
perf report: Print the map table just after samples for which no map was found
perf report: Add multiple event support
perf session: Change perf_session post processing functions to take histogram tree
perf session: Add storage for seperating event types in report
perf session: Change add_hist_entry to take the tree root instead of session
perf record: Add ID and to recorded event data when recording multiple events
...
perf_arch_fetch_caller_regs() is exported for the overriden x86
version, but not for the generic weak version.
As a general rule, weak functions should not have their symbol
exported in the same file they are defined.
So let's export it on trace_event_perf.c as it is used by trace
events only.
This fixes:
ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!
-v2: And also only build it if trace events are enabled.
-v3: Fix changelog mistake
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
locking: Make sparse work with inline spinlocks and rwlocks
x86/mce: Fix RCU lockdep splats
rcu: Increase RCU CPU stall timeouts if PROVE_RCU
ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
rcu: Suppress RCU lockdep warnings during early boot
rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
rcu: Suppress __mpol_dup() false positive from RCU lockdep
rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
rcu: Add control variables to lockdep_rcu_dereference() diagnostics
rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
rcu: Use wrapper function instead of exporting tasklist_lock
sched, rcu: Fix rcu_dereference() for RCU-lockdep
rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
x86/gart: Unexport gart_iommu_aperture
Fix trivial conflicts in kernel/trace/ftrace.c
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Do not record user stack trace from NMI context
tracing: Disable buffer switching when starting or stopping trace
tracing: Use same local variable when resetting the ring buffer
function-graph: Init curr_ret_stack with ret_stack
ring-buffer: Move disabled check into preempt disable section
function-graph: Add tracing_thresh support to function_graph tracer
tracing: Update the comm field in the right variable in update_max_tr
function-graph: Use comment notation for func names of dangling '}'
function-graph: Fix unused reference to ftrace_set_func()
tracing: Fix warning in s_next of trace file ops
tracing: Include irqflags headers from trace clock
A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.
Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:
# echo 1 > /debug/tracing/options/userstacktrace
# echo 1 > /debug/tracing/events/lock/enable
# while :; do ls > /dev/null ; done
Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.
1) | copy_user_handle_tail() {
1) | bad_area_nosemaphore() {
1) | __bad_area_nosemaphore() {
1) | no_context() {
1) | fixup_exception() {
1) 0.319 us | search_exception_tables();
1) 0.873 us | }
[...]
1) 0.314 us | __rcu_read_unlock();
1) 0.325 us | native_apic_mem_write();
1) 0.943 us | }
1) 0.304 us | rcu_nmi_exit();
[...]
1) 0.479 us | find_vma();
1) | bad_area() {
1) | __bad_area() {
After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:
ffffffff8135b660 <page_fault>:
ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp
ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800 <error_entry>
What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.
The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.
The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.
Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the trace iterator is read, tracing_start() and tracing_stop()
is called to stop tracing while the iterator is processing the trace
output.
These functions disable both the standard buffer and the max latency
buffer. But if the wakeup tracer is running, it can switch these
buffers between the two disables:
buffer = global_trace.buffer;
if (buffer)
ring_buffer_record_disable(buffer);
<<<--------- swap happens here
buffer = max_tr.buffer;
if (buffer)
ring_buffer_record_disable(buffer);
What happens is that we disabled the same buffer twice. On tracing_start()
we can enable the same buffer twice. All ring_buffer_record_disable()
must be matched with a ring_buffer_record_enable() or the buffer
can be disable permanently, or enable prematurely, and cause a bug
where a reset happens while a trace is commiting.
This patch protects these two by taking the ftrace_max_lock to prevent
a switch from occurring.
Found with Li Zefan's ftrace_stress_test.
Cc: stable@kernel.org
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the ftrace code that resets the ring buffer it references the
buffer with a local variable, but then uses the tr->buffer as the
parameter to reset. If the wakeup tracer is running, which can
switch the tr->buffer with the max saved buffer, this can break
the requirement of disabling the buffer before the reset.
buffer = tr->buffer;
ring_buffer_record_disable(buffer);
synchronize_sched();
__tracing_reset(tr->buffer, cpu);
If the tr->buffer is swapped, then the reset is not happening to the
buffer that was disabled. This will cause the ring buffer to fail.
Found with Li Zefan's ftrace_stress_test.
Cc: stable@kernel.org
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the graph tracer is active, and a task is forked but the allocating of
the processes graph stack fails, it can cause crash later on.
This is due to the temporary stack being NULL, but the curr_ret_stack
variable is copied from the parent. If it is not -1, then in
ftrace_graph_probe_sched_switch() the following:
for (index = next->curr_ret_stack; index >= 0; index--)
next->ret_stack[index].calltime += timestamp;
Will cause a kernel OOPS.
Found with Li Zefan's ftrace_stress_test.
Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring buffer resizing and resetting relies on a schedule RCU
action. The buffers are disabled, a synchronize_sched() is called
and then the resize or reset takes place.
But this only works if the disabling of the buffers are within the
preempt disabled section, otherwise a window exists that the buffers
can be written to while a reset or resize takes place.
Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B949E43.2010906@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Export perf_trace_regs and perf_arch_fetch_caller_regs since module will
use these.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
[ use EXPORT_PER_CPU_SYMBOL_GPL() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4B989C1B.2090407@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace the calls to read_barrier_depends() in
ftrace_list_func() with rcu_dereference_raw() to improve
readability. The reason that we use rcu_dereference_raw() here
is that removed entries are never freed, instead they are simply
leaked. This is one of a very few cases where use of
rcu_dereference_raw() is the long-term right answer. And I
don't yet know of any others. ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Drop the obsolete "profile" naming used by perf for trace events.
Perf can now do more than simple events counting, so generalize
the API naming.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.
What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.
Let's use the new perf_fetch_caller_regs() for that.
Comparison with perf record -e lock: -R -a -f -g
Before:
perf [kernel] [k] __do_softirq
|
--- __do_softirq
|
|--55.16%-- __open
|
--44.84%-- __write_nocancel
After:
perf [kernel] [k] perf_tp_event
|
--- perf_tp_event
|
|--41.07%-- lock_acquire
| |
| |--39.36%-- _raw_spin_lock
| | |
| | |--7.81%-- hrtimer_interrupt
| | | smp_apic_timer_interrupt
| | | apic_timer_interrupt
The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.
Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.
v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
Add support for tracing_thresh to the function_graph tracer. This
version of this feature isolates the checks into new entry and
return functions, to avoid adding more conditional code into the
main function_graph paths.
When the tracing_thresh is set and the function graph tracer is
enabled, only the functions that took longer than the time in
microseconds that was set in tracing_thresh are recorded. To do this
efficiently, only the function exits are recorded:
[tracing]# echo 100 > tracing_thresh
[tracing]# echo function_graph > current_tracer
[tracing]# cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) ! 119.214 us | } /* smp_apic_timer_interrupt */
1) <========== |
0) ! 101.527 us | } /* __rcu_process_callbacks */
0) ! 126.461 us | } /* rcu_process_callbacks */
0) ! 145.111 us | } /* __do_softirq */
0) ! 149.667 us | } /* do_softirq */
0) ! 168.817 us | } /* irq_exit */
0) ! 248.254 us | } /* smp_apic_timer_interrupt */
Also, add support for specifying tracing_thresh on the kernel
command line. When used like so: "tracing_thresh=200 ftrace=function_graph"
this can be used to analyse system startup. It is important to disable
tracing soon after boot, in order to avoid losing the trace data.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tim Bird <tim.bird@am.sony.com>
LKML-Reference: <4B87098B.4040308@am.sony.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latency output showed:
# | task: -3 (uid:0 nice:0 policy:1 rt_prio:99)
The comm is missing in the "task:" and it looks like a minus 3 is
the output. The correct display should be:
# | task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99)
The problem is that the comm is being stored in the wrong data
structure. The max_tr.data[cpu] is what stores the comm, not the
tr->data[cpu].
Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace
ended up showing just the '-' sign followed by the pid.
Also remove a needless initialization of max_data.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1267824230-23861-1-git-send-email-acme@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a '}' does not have a matching function start, the name is printed
within parenthesis. But this makes it confusing between ending '}'
and function starts. This patch makes the function name appear in C comment
notation.
Old view:
3) 1.281 us | } (might_fault)
3) 3.620 us | } (filldir)
3) 5.251 us | } (call_filldir)
3) | call_filldir() {
3) | filldir() {
New view:
3) 1.281 us | } /* might_fault */
3) 3.620 us | } /* filldir */
3) 5.251 us | } /* call_filldir */
3) | call_filldir() {
3) | filldir() {
Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The declaration of ftrace_set_func() is at the start of the ftrace.c file
and wrapped with a #ifdef CONFIG_FUNCTION_GRAPH condition. If function
graph tracing is enabled but CONFIG_DYNAMIC_FTRACE is not, a warning
about that function being declared static and unused is given.
This really should have been placed within the CONFIG_FUNCTION_GRAPH
condition that uses ftrace_set_func().
Moving the declaration down fixes the warning and makes the code cleaner.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This warning in s_next() can be triggered by lseek():
[<c018b3f7>] ? s_next+0x77/0x80
[<c013e3c1>] warn_slowpath_common+0x81/0xa0
[<c018b3f7>] ? s_next+0x77/0x80
[<c013e3fa>] warn_slowpath_null+0x1a/0x20
[<c018b3f7>] s_next+0x77/0x80
[<c01efa77>] traverse+0x117/0x200
[<c01eff13>] seq_lseek+0xa3/0x120
[<c01efe70>] ? seq_lseek+0x0/0x120
[<c01d7081>] vfs_llseek+0x41/0x50
[<c01d8116>] sys_llseek+0x66/0xa0
[<c0102bd0>] sysenter_do_call+0x12/0x26
The iterator "leftover" variable is zeroed in the opening of the trace
file. But lseek can call s_start() which will call s_next() without
reseting the "leftover" variable back to zero, which might trigger
the WARN_ON_ONCE(iter->leftover) that is in s_next().
Cc: stable@kernel.org
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits)
block: don't access jiffies when initialising io_context
cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds
block: fix for "Consolidate phys_segment and hw_segment limits"
cfq-iosched: quantum check tweak
blktrace: perform cleanup after setup error
blkdev: fix merge_bvec_fn return value checks
cfq-iosched: requests "in flight" vs "in driver" clarification
cciss: Fix problem with scatter gather elements in the scsi half of the driver
cciss: eliminate unnecessary pointer use in cciss scsi code
cciss: do not use void pointer for scsi hba data
cciss: factor out scatter gather chain block mapping code
cciss: fix scatter gather chain block dma direction kludge
cciss: simplify scatter gather code
cciss: factor out scatter gather chain block allocation and freeing
cciss: detect bad alignment of scsi commands at build time
cciss: clarify command list padding calculation
cfq-iosched: rethink seeky detection for SSDs
cfq-iosched: rework seeky detection
block: remove padding from io_context on 64bit builds
block: Consolidate phys_segment and hw_segment limits
...
Currently even if BLKTRACESETUP ioctl has failed user must call
BLKTRACETEARDOWN to be shure what all staff was cleaned, which
is contr-intuitive.
Let's setup ioctl make necessery cleanup by it self.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
trace_clock.c includes spinlock.h, which ends up including
asm/system.h, which in turn includes linux/irqflags.h in x86.
So the definition of raw_local_irq_save is luckily covered there,
but this is not the case in parisc:
tip/kernel/trace/trace_clock.c:86: error: implicit declaration of function 'raw_local_irq_save'
tip/kernel/trace/trace_clock.c:112: error: implicit declaration of function 'raw_local_irq_restore'
We need to include linux/irqflags.h directly from trace_clock.c
to avoid such build error.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function graph tracer is currently the most invasive tracer
in the ftrace family. It can easily overflow the buffer even with
10megs per CPU. This means that events can often be lost.
On start up, or after events are lost, if the function return is
recorded but the function enter was lost, all we get to see is the
exiting '}'.
Here is how a typical trace output starts:
[tracing] cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) + 91.897 us | }
0) ! 567.961 us | }
0) <========== |
0) ! 579.083 us | _raw_spin_lock_irqsave();
0) 4.694 us | _raw_spin_unlock_irqrestore();
0) ! 594.862 us | }
0) ! 603.361 us | }
0) ! 613.574 us | }
0) ! 623.554 us | }
0) 3.653 us | fget_light();
0) | sock_poll() {
There are a series of '}' with no matching "func() {". There's no information
to what functions these ending brackets belong to.
This patch adds a stack on the per cpu structure used in outputting
the function graph tracer to keep track of what function was outputted.
Then on a function exit event, it checks the depth to see if the
function exit has a matching entry event. If it does, then it only
prints the '}', otherwise it adds the function name after the '}'.
This allows function exit events to show what function they belong to
at trace output startup, when the entry was lost due to ring buffer
overflow, or even after a new task is scheduled in.
Here is what the above trace will look like after this patch:
[tracing] cat trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) + 91.897 us | } (irq_exit)
0) ! 567.961 us | } (smp_apic_timer_interrupt)
0) <========== |
0) ! 579.083 us | _raw_spin_lock_irqsave();
0) 4.694 us | _raw_spin_unlock_irqrestore();
0) ! 594.862 us | } (add_wait_queue)
0) ! 603.361 us | } (__pollwait)
0) ! 613.574 us | } (tcp_poll)
0) ! 623.554 us | } (sock_poll)
0) 3.653 us | fget_light();
0) | sock_poll() {
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Discard freeing field->type since it is not necessary.
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-5-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "cpu" variable is declared at the start of the function and
also within a branch, with the exact same initialization.
Remove the local variable of the same name in the branch.
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-3-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The power tracer has been converted to power trace events.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B84D50E.4070806@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
GCC 4.5 introduces behavior that forces the alignment of structures to
use the largest possible value. The default value is 32 bytes, so if
some structures are defined with a 4-byte alignment and others aren't
declared with an alignment constraint at all - it will align at 32-bytes.
For things like the ftrace events, this results in a non-standard array.
When initializing the ftrace subsystem, we traverse the _ftrace_events
section and call the initialization callback for each event. When the
structures are misaligned, we could be treating another part of the
structure (or the zeroed out space between them) as a function pointer.
This patch forces the alignment for all the ftrace_event_call structures
to 4 bytes.
Without this patch, the kernel fails to boot very early when built with
gcc 4.5.
It's trivial to check the alignment of the members of the array, so it
might be worthwhile to add something to the build system to do that
automatically. Unfortunately, that only covers this case. I've asked one
of the gcc developers about adding a warning when this condition is seen.
Cc: stable@kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
LKML-Reference: <4B85770B.6010901@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
KPROBES_EVENT actually depends on the regs and stack access API
(b1cf540f) and not on x86.
So introduce a new config option which architectures can select if
they have the API implemented and switch x86.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20100210162517.GB6933@osiris.boeblingen.de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Most implementations of arch_syscall_addr() are the same, so create a
default version in common code and move the one piece that differs (the
syscall table) to asm/syscall.h. New arch ports don't have to waste
time copying & pasting this simple function.
The s390/sparc versions need to be different, so document why.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1264498803-17278-1-git-send-email-vapier@gentoo.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Trying to add a probe like:
echo p:myprobe 0x10000 > /sys/kernel/debug/tracing/kprobe_events
will fail since the wrong pointer is passed to strict_strtoul
when trying to convert the address to an unsigned long.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100210162346.GA6933@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I don't see why we can only clear all functions from the filter.
After patching:
# echo sys_open > set_graph_function
# echo sys_close >> set_graph_function
# cat set_graph_function
sys_open
sys_close
# echo '!sys_close' >> set_graph_function
# cat set_graph_function
sys_open
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B726388.2000408@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some misspelled occurences of 'octet' and some comments were also fixed
as I was on it.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Some comments misspell "truly"; this fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Remove record freezing. Because kprobes never puts probe on
ftrace's mcount call anymore, it doesn't need ftrace to check
whether kprobes on it.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100202214925.4694.73469.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>