Impact: API *CHANGE*. Must update all tracepoint users.
Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint
structure in a single spot for all the kernel. It helps reducing memory
consumption, especially when declaring a lot of tracepoints, e.g. for
kmalloc tracing.
*API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for
tracepoint declarations rather than DEFINE_TRACE(). This is the sane way
to do it. The name previously used was misleading.
Updates scheduler instrumentation to follow this API change.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use module notifiers for tracepoint updates rather than adding a hook in
module.c.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
That's overkill, takes space. We have a global tracepoint registery in
header files anyway.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make sure tracepoints can be called within ftrace callbacks.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix race
Set the probe array pointer to NULL when the tracepoint is disabled.
The probe array point not being NULL could generate a race condition
where the reader would dereference a freed pointer.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix a bug in sample tracepoints
Need a tracepoint_synchronize_unregister() before the end of exit() to
make sure every probe callers have exited the non preemptible section
and thus are not executing the probe code anymore.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API.
Allow markers to be used only for declaration, without function call
associated. Useful to create specialized probes.
The problem we had is that two function calls were required when one
wanted to put a marker in a tracepoint probe. Now the marker can be used
simply for trace data type declaration, leaving the trace write work
within the tracepoint probe without any additional function call.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API
Add a new API trace_mark_tp(), which declares a marker within a
tracepoint probe. When the marker is activated, the tracepoint is
automatically enabled.
No branch test is used at the marker site, because it would be a
duplicate of the branch already present in the tracepoint.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use module notifiers instead of adding a hook in module.c.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make marker critical code use notrace to make sure they can be used as an
ftrace callback.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix (for future changes)
That seemed to cause built issue when marker.h is included early, even
though stdargs.h is included in kernel.h.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix marker registers/unregister race
get_marker() can return a NULL entry because the mutex is released in
the middle of those functions. Make sure we check to see if it has been
concurrently removed.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API, useful for tracepoints and markers.
Add _notrace version to rcu_read_*_sched().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Reviewed-by: Paul E McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since this scripts output is usually redirected, put error messages
on standard error and exit with error code if no data is found.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because it has goto out before ftrace_list == &ftrace_list_end,
that's to say, we never meet this condition.
Signed-off-by: walimis <walimisdev@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pekka reported a crash when resizing the mmiotrace tracer (if only
mmiotrace is enabled).
This happens because in that case we do not allocate the max buffer,
but we try to use it.
Make ring_buffer_resize() idempotent against NULL buffers.
Reported-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: better handling of CPU buffer start annotation
Because of the confusion with the per CPU buffers wrapping where
one CPU might be more active at the end of the trace than the other
CPUs causing that one CPU to have a shorter history. Kernel
developers were confused by the "missing" data of that one CPU
at the beginning of the trace output. An annotation was added to
the trace output to show that the buffer had started:
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
##### CPU 3 buffer started ####
<idle>-0 [003] 158.192959: smp_apic_timer_interrupt
[...]
<idle>-0 [003] 161.556520: default_idle
##### CPU 1 buffer started ####
<idle>-0 [001] 161.592494: hrtimer_force_reprogram
[etc]
But this annotation gets a bit messy when tracers do not fill the
buffers. This patch does a couple of things:
One) it adds a flag to trace_options to disable these annotations
Two) it does not annotate if the tracer did not overflow its buffer.
This makes the output much cleaner.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: rename file /debug/tracing/iter_ctrl to /debug/tracing/trace_options
The original ftrace had a file called "iter_ctrl" that would control
the way the output was iterated. But this file grew into a catch all
for different trace options. This patch renames the file from iter_ctrl
to trace_options to reflect this change.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the units of buffer_size_kb to kilobytes
This patch changes the units of the buffer_size_kb file to kilobytes.
Reading and writing to the file uses kilobytes as units. To help
users to know what units are used, the output of the file now
looks like:
# cat /debug/tracing/buffer_size_kb
1408
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: rename of debugfs file trace_entries to buffer_size_kb
The original ftrace had fixed size entries, and the number of entries
was shown and modified via the file called trace_entries. By converting
to the unified trace buffer, we now allow for variable size entries
which makes the meaning of trace_entries pointless.
Since trace_size might be confused to the size of the trace, this patch
names it "buffer_size_kb" (thanks to Arjan van de Ven for this idea).
[ mingo@elte.hu: changed from buffer_size to buffer_size_kb ]
( Note, the units are still bytes - the next patch changes that,
to keep the wide rename patch separate from the unit-change patch. )
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Optimize a bit the function return tracer
This patch changes the calling convention of prepare_ftrace_return to
pass its arguments by register. This will optimize it a bit and
prepare it to support dynamic tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove spinlocks and irq disabling in function return tracer.
I've tried to figure out all of the race condition that could happen
when the tracer pushes or pops a return address trace to/from the
current thread_info.
Theory:
_ One thread can only execute on one cpu at a time. So this code
doesn't need to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an
interrupt. If an interrupt is raised, it will increase the index of
the return stack storage and then execute until the end of the
tracing to finally free the index it used. We don't need to disable
irqs.
This is theorical. In practice, I've tested it with a two-core SMP and
had no problem at all. Perhaps -tip testing could confirm it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: File name change of trace_unlikely.c
The "unlikely" name for the tracer is quite ugly. We renamed all the
parts of it to "branch" and now it is time to rename the file too.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: added new branch tracer
Currently the tracing of branch profiling (unlikelys and likelys hit)
is only activated by the iter_ctrl. This patch adds a tracer called
"branch" that will just trace the branch profiling. The advantage
of adding this tracer is that it can be added to the ftrace selftests
on startup.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: rename of iter_ctrl unlikely to branch
The unlikely name is ugly. This patch converts the iter_ctrl command
"unlikely" and "nounlikely" to "branch" and "nobranch" respectively.
It also renames a lot of internal functions to use "branch" instead
of "unlikely".
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: name change of unlikely tracer and profiler
Ingo Molnar suggested changing the config from UNLIKELY_PROFILE
to BRANCH_PROFILING. I never did like the "unlikely" name so I
went one step farther, and renamed all the unlikely configurations
to a "BRANCH" variant.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix the following warning from the perl syntax checking tool perlcritic.
This tool is a lint like tool that checks for perl best practices.
Loop iterator is not lexical at line 113, column 1.
See page 108 of PBP. (Severity: 5)
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steve suggested the to change the output from this:
> bash-3471 [003] 357.014755: [ MISS ] sched_info_dequeued:sched_stats.h:177
> bash-3471 [003] 357.014756: [ .... ] update_curr:sched_fair.c:489
> bash-3471 [003] 357.014758: [ .... ] calc_delta_fair:sched_fair.c:411
to this:
> bash-3471 [003] 357.014755: [ MISS ] sched_info_dequeued:sched_stats.h:177
> bash-3471 [003] 357.014756: [ ok ] update_curr:sched_fair.c:489
> bash-3471 [003] 357.014758: [ ok ] calc_delta_fair:sched_fair.c:411
as it makes it clearer to the user what it means exactly.
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bootup crash
the branch tracer missed arch/x86/vdso/vclock_gettime.c from
disabling tracing, which caused such bootup crashes:
[ 201.840097] init[1]: segfault at 7fffed3fe7c0 ip 00007fffed3fea2e sp 000077
also clean up the ugly ifdefs in arch/x86/kernel/vsyscall_64.c by
creating DISABLE_UNLIKELY_PROFILE facility for code to turn off
instrumentation on a per file basis.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: modify the tracer output, to make it a bit easier to read
Change the output from:
> bash-3471 [003] 357.014755: [INCORRECT] sched_info_dequeued:sched_stats.h:177
> bash-3471 [003] 357.014756: [correct] update_curr:sched_fair.c:489
> bash-3471 [003] 357.014758: [correct] calc_delta_fair:sched_fair.c:411
to:
> bash-3471 [003] 357.014755: [ MISS ] sched_info_dequeued:sched_stats.h:177
> bash-3471 [003] 357.014756: [ .... ] update_curr:sched_fair.c:489
> bash-3471 [003] 357.014758: [ .... ] calc_delta_fair:sched_fair.c:411
it's good to have fields aligned vertically, and the only important
information is a prediction miss, so display only that information.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new likely/unlikely branch tracer
This patch adds a way to record the instances of the likely() and unlikely()
branch condition annotations.
When "unlikely" is set in /debugfs/tracing/iter_ctrl the unlikely conditions
will be added to any of the ftrace tracers. The change takes effect when
a new tracer is passed into the current_tracer file.
For example:
bash-3471 [003] 357.014755: [INCORRECT] sched_info_dequeued:sched_stats.h:177
bash-3471 [003] 357.014756: [correct] update_curr:sched_fair.c:489
bash-3471 [003] 357.014758: [correct] calc_delta_fair:sched_fair.c:411
bash-3471 [003] 357.014759: [correct] account_group_exec_runtime:sched_stats.h:356
bash-3471 [003] 357.014761: [correct] update_curr:sched_fair.c:489
bash-3471 [003] 357.014763: [INCORRECT] calc_delta_fair:sched_fair.c:411
bash-3471 [003] 357.014765: [correct] calc_delta_mine:sched.c:1279
Which shows the normal tracer heading, as well as whether the condition was
correct "[correct]" or was mistaken "[INCORRECT]", followed by the function,
file name and line number.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new unlikely/likely profiler
Andrew Morton recently suggested having an in-kernel way to profile
likely and unlikely macros. This patch achieves that goal.
When configured, every(*) likely and unlikely macro gets a counter attached
to it. When the condition is hit, the hit and misses of that condition
are recorded. These numbers can later be retrieved by:
/debugfs/tracing/profile_likely - All likely markers
/debugfs/tracing/profile_unlikely - All unlikely markers.
# cat /debug/tracing/profile_unlikely | head
correct incorrect % Function File Line
------- --------- - -------- ---- ----
2167 0 0 do_arch_prctl process_64.c 832
0 0 0 do_arch_prctl process_64.c 804
2670 0 0 IS_ERR err.h 34
71230 5693 7 __switch_to process_64.c 673
76919 0 0 __switch_to process_64.c 639
43184 33743 43 __switch_to process_64.c 624
12740 64181 83 __switch_to process_64.c 594
12740 64174 83 __switch_to process_64.c 590
# cat /debug/tracing/profile_unlikely | \
awk '{ if ($3 > 25) print $0; }' |head -20
44963 35259 43 __switch_to process_64.c 624
12762 67454 84 __switch_to process_64.c 594
12762 67447 84 __switch_to process_64.c 590
1478 595 28 syscall_get_error syscall.h 51
0 2821 100 syscall_trace_leave ptrace.c 1567
0 1 100 native_smp_prepare_cpus smpboot.c 1237
86338 265881 75 calc_delta_fair sched_fair.c 408
210410 108540 34 calc_delta_mine sched.c 1267
0 54550 100 sched_info_queued sched_stats.h 222
51899 66435 56 pick_next_task_fair sched_fair.c 1422
6 10 62 yield_task_fair sched_fair.c 982
7325 2692 26 rt_policy sched.c 144
0 1270 100 pre_schedule_rt sched_rt.c 1261
1268 48073 97 pick_next_task_rt sched_rt.c 884
0 45181 100 sched_info_dequeued sched_stats.h 177
0 15 100 sched_move_task sched.c 8700
0 15 100 sched_move_task sched.c 8690
53167 33217 38 schedule sched.c 4457
0 80208 100 sched_info_switch sched_stats.h 270
30585 49631 61 context_switch sched.c 2619
# cat /debug/tracing/profile_likely | awk '{ if ($3 > 25) print $0; }'
39900 36577 47 pick_next_task sched.c 4397
20824 15233 42 switch_mm mmu_context_64.h 18
0 7 100 __cancel_work_timer workqueue.c 560
617 66484 99 clocksource_adjust timekeeping.c 456
0 346340 100 audit_syscall_exit auditsc.c 1570
38 347350 99 audit_get_context auditsc.c 732
0 345244 100 audit_syscall_entry auditsc.c 1541
38 1017 96 audit_free auditsc.c 1446
0 1090 100 audit_alloc auditsc.c 862
2618 1090 29 audit_alloc auditsc.c 858
0 6 100 move_masked_irq migration.c 9
1 198 99 probe_sched_wakeup trace_sched_switch.c 58
2 2 50 probe_wakeup trace_sched_wakeup.c 227
0 2 100 probe_wakeup_sched_switch trace_sched_wakeup.c 144
4514 2090 31 __grab_cache_page filemap.c 2149
12882 228786 94 mapping_unevictable pagemap.h 50
4 11 73 __flush_cpu_slab slub.c 1466
627757 330451 34 slab_free slub.c 1731
2959 61245 95 dentry_lru_del_init dcache.c 153
946 1217 56 load_elf_binary binfmt_elf.c 904
102 82 44 disk_put_part genhd.h 206
1 1 50 dst_gc_task dst.c 82
0 19 100 tcp_mss_split_point tcp_output.c 1126
As you can see by the above, there's a bit of work to do in rethinking
the use of some unlikelys and likelys. Note: the unlikely case had 71 hits
that were more than 25%.
Note: After submitting my first version of this patch, Andrew Morton
showed me a version written by Daniel Walker, where I picked up
the following ideas from:
1) Using __builtin_constant_p to avoid profiling fixed values.
2) Using __FILE__ instead of instruction pointers.
3) Using the preprocessor to stop all profiling of likely
annotations from vsyscall_64.c.
Thanks to Andrew Morton, Arjan van de Ven, Theodore Tso and Ingo Molnar
for their feed back on this patch.
(*) Not ever unlikely is recorded, those that are used by vsyscalls
(a few of them) had to have profiling disabled.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: deadlock fix in ring_buffer_read_start
The ring_buffer_iter_reset was called from ring_buffer_read_start
where both grabbed the reader_lock.
This patch separates out the internals of ring_buffer_iter_reset
to its own function so that both APIs may grab the reader_lock.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: disable preemption when calling sched_clock()
The ring_buffer_time_stamp still uses sched_clock as its counter.
But it is a bug to call it with preemption enabled. This requirement
should not be pushed to the ring_buffer_time_stamp callers, so
the ring_buffer_time_stamp needs to disable preemption when calling
sched_clock.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Cleanups on the boot tracer and ftrace
This patch bring some cleanups about the boot tracer headers. The
functions and structures of this tracer have nothing related to ftrace
and should have so their own header file.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Restructure WARN_ONs in ring_buffer.c
The current WARN_ON macros in ring_buffer.c are quite ugly.
This patch cleans them up and uses a single RB_WARN_ON that returns
the value of the condition. This allows the caller to abort the
function if the condition is true.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enable/disable ring buffer recording API added
Several kernel developers have requested that there be a way to stop
recording into the ring buffers with a simple switch that can also
be enabled from userspace. This patch addes a new kernel API to the
ring buffers called:
tracing_on()
tracing_off()
When tracing_off() is called, all ring buffers will not be able to record
into their buffers.
tracing_on() will enable the ring buffers again.
These two act like an on/off switch. That is, there is no counting of the
number of times tracing_off or tracing_on has been called.
A new file is added to the debugfs/tracing directory called
tracing_on
This allows for userspace applications to also flip the switch.
echo 0 > debugfs/tracing/tracing_on
disables the tracing.
echo 1 > /debugfs/tracing/tracing_on
enables it.
Note, this does not disable or enable any tracers. It only sets or clears
a flag that needs to be set in order for the ring buffers to write to
their buffers. It is a global flag, and affects all ring buffers.
The buffers start out with tracing_on enabled.
There are now three flags that control recording into the buffers:
tracing_on: which affects all ring buffer tracers.
buffer->record_disabled: which affects an allocated buffer, which may be set
if an anomaly is detected, and tracing is disabled.
cpu_buffer->record_disabled: which is set by tracing_stop() or if an
anomaly is detected. tracing_start can not reenable this if
an anomaly occurred.
The userspace debugfs/tracing/tracing_enabled is implemented with
tracing_stop() but the user space code can not enable it if the kernel
called tracing_stop().
Userspace can enable the tracing_on even if the kernel disabled it.
It is just a switch used to stop tracing if a condition was hit.
tracing_on is not for protecting critical areas in the kernel nor is
it for stopping tracing if an anomaly occurred. This is because userspace
can reenable it at any time.
Side effect: With this patch, I discovered a dead variable in ftrace.c
called tracing_on. This patch removes it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context
nohz: disable tick_nohz_kick_tick() for now
irq: call __irq_enter() before calling the tick_idle_check
x86: HPET: enter hpet_interrupt_handler with interrupts disabled
x86: HPET: read from HPET_Tn_CMP() not HPET_T0_CMP
x86: HPET: convert WARN_ON to WARN_ON_ONCE
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: release buddies on yield
fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock
sched: clean up debug info
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ring-buffer: prevent infinite looping on time stamping
ftrace: disable tracing on resize
ftrace: fix breakage in bin_fmt results
ftrace: ftrace.txt version update
ftrace: update txt document
Impact: serialize reader accesses to individual CPU ring buffers
The code in the ring buffer expects only one reader at a time, but currently
it puts that requirement on the caller. This is not strong enough, and this
patch adds a "reader_lock" that serializes the access to the reader API
of the ring buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
[XFS] XFS: Check for valid transaction headers in recovery
[XFS] handle memory allocation failures during log initialisation
[XFS] Account for allocated blocks when expanding directories
[XFS] Wait for all I/O on truncate to zero file size
[XFS] Fix use-after-free with log and quotas