OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Steven Whitehouse	ff7b1b4f00	perfcounters: export perf_tpcounter_event Needed for modular tracepoint support. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-16 01:10:04 +02:00
Peter Zijlstra	d3d21c412d	perf_counter: log full path names Impact: fix perf-report output for /home mounted binaries, etc. dentry_path() only provide path-names up to the mount root, which is unsuited for out purpose, use d_path() instead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090409085524.601794134@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 11:50:54 +02:00
Peter Zijlstra	1ccd154978	perf_counter: sysctl for system wide perf counters Impact: add sysctl for paranoid/relaxed perfcounters policy Allow the use of system wide perf counters to everybody, but provide a sysctl to disable it for the paranoid security minded. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090409085524.514046352@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 11:50:52 +02:00
Peter Zijlstra	9ee318a782	perf_counter: optimize mmap/comm tracking Impact: performance optimization The mmap/comm tracking code does quite a lot of work before it discovers there's no interest in it, avoid that by keeping a counter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090409085524.427173196@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 11:50:43 +02:00
Ingo Molnar	888fcee066	perf_counter: fix off task->comm by one strlen() does not include the \0. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 09:48:22 +02:00
Peter Zijlstra	78f13e9525	perf_counter: allow for data addresses to be recorded Paul suggested we allow for data addresses to be recorded along with the traditional IPs as power can provide these. For now, only the software pagefault events provide data addresses, but in the future power might as well for some events. x86 doesn't seem capable of providing this atm. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130409.394816925@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 19:05:56 +02:00
Peter Zijlstra	4d855457d8	perf_counter: move PERF_RECORD_TIME Move PERF_RECORD_TIME so that all the fixed length items come before the variable length ones. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130409.307926436@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 19:05:55 +02:00
Peter Zijlstra	8d1b2d9361	perf_counter: track task-comm data Similar to the mmap data stream, add one that tracks the task COMM field, so that the userspace reporting knows what to call a task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130409.127422406@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 19:05:47 +02:00
Peter Zijlstra	6b6e5486b3	perf_counter: use misc field to widen type Push the PERF_EVENT_COUNTER_OVERFLOW bit into the misc field so that we can have the full 32bit for PERF_RECORD_ bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130408.891867663@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 18:53:28 +02:00
Peter Zijlstra	6fab01927e	perf_counter: provide misc bits in the event header Limit the size of each record to 64k (or should we count in multiples of u64 and have a 512K limit?), this gives 16 bits or spare room in the header, which we can use for misc bits, so as to not have to grow the record with u64 every time we have a few bits to report. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130408.769271806@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 18:53:27 +02:00
Peter Zijlstra	e30e08f65c	perf_counter: fix NMI race in task clock We should not be updating ctx->time from NMI context, work around that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130408.681326666@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 18:53:27 +02:00
Peter Zijlstra	bce379bf35	perf_counter: minimize context time updates Push the update_context_time() calls up the stack so that we get less invokations and thereby a less noisy output: before: # ./perfstat -e 1:0 -e 1:1 -e 1:1 -e 1:1 -l ls > /dev/null Performance counter stats for 'ls': 10.163691 cpu clock ticks (msecs) (scaled from 98.94%) 10.215360 task clock ticks (msecs) (scaled from 98.18%) 10.185549 task clock ticks (msecs) (scaled from 98.53%) 10.183581 task clock ticks (msecs) (scaled from 98.71%) Wall-clock time elapsed: 11.912858 msecs after: # ./perfstat -e 1:0 -e 1:1 -e 1:1 -e 1:1 -l ls > /dev/null Performance counter stats for 'ls': 9.316630 cpu clock ticks (msecs) 9.280789 task clock ticks (msecs) 9.280789 task clock ticks (msecs) 9.280789 task clock ticks (msecs) Wall-clock time elapsed: 9.574872 msecs Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.618876874@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:01 +02:00
Peter Zijlstra	849691a6cd	perf_counter: remove rq->lock usage Now that all the task runtime clock users are gone, remove the ugly rq->lock usage from perf counters, which solves the nasty deadlock seen when a software task clock counter was read from an NMI overflow context. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.531137582@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:01 +02:00
Peter Zijlstra	a39d6f2556	perf_counter: rework the task clock software counter Rework the task clock software counter to use the context time instead of the task runtime clock, this removes the last such user. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.445450972@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:00 +02:00
Peter Zijlstra	4af4998b8a	perf_counter: rework context time Since perf_counter_context is switched along with tasks, we can maintain the context time without using the task runtime clock. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.353552838@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:00 +02:00
Peter Zijlstra	4c9e25428f	perf_counter: change event definition Currently the definition of an event is slightly ambiguous. We have wakeup events, for poll() and SIGIO, which are either generated when a record crosses a page boundary (hw_events.wakeup_events == 0), or every wakeup_events new records. Now a record can be either a counter overflow record, or a number of different things, like the mmap PROT_EXEC region notifications. Then there is the PERF_COUNTER_IOC_REFRESH event limit, which only considers counter overflows. This patch changes then wakeup_events and SIGIO notification to only consider overflow events. Furthermore it changes the SIGIO notification to report SIGHUP when the event limit is reached and the counter will be disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.266679874@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:59 +02:00
Peter Zijlstra	79f1464156	perf_counter: counter overflow limit Provide means to auto-disable the counter after 'n' overflow events. Create the counter with hw_event.disabled = 1, and then issue an ioctl(fd, PREF_COUNTER_IOC_REFRESH, n); to set the limit and enable the counter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.083139737@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:58 +02:00
Peter Zijlstra	339f7c90b8	perf_counter: PERF_RECORD_TIME By popular request, provide means to log a timestamp along with the counter overflow event. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.024173282@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:57 +02:00
Peter Zijlstra	ebb3c4c4cb	perf_counter: fix the mlock accounting Reading through the code I saw I forgot the finish the mlock accounting. Do so now. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.899767331@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:57 +02:00
Peter Zijlstra	f6c7d5fe58	perf_counter: theres more to overflow than writing events Prepare for more generic overflow handling. The new perf_counter_overflow() method will handle the generic bits of the counter overflow, and can return a !0 return value, in which case the counter should be (soft) disabled, so that it won't count until it's properly disabled. XXX: do powerpc and swcounter Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.812109629@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:56 +02:00
Peter Zijlstra	671dec5daf	perf_counter: generalize pending infrastructure Prepare the pending infrastructure to do more than wakeups. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.634732847@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:55 +02:00
Peter Zijlstra	3c446b3d3b	perf_counter: SIGIO support Provide support for fcntl() I/O availability signals. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.579788800@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:55 +02:00
Peter Zijlstra	9c03d88e32	perf_counter: add more context information Change the callchain context entries to u16, so as to gain some space. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.457320003@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:54 +02:00
Peter Zijlstra	92f22a3865	perf_counter: update mmap() counter read Paul noted that we don't need SMP barriers for the mmap() counter read because its always on the same cpu (otherwise you can't access the hw counter anyway). So remove the SMP barriers and replace them with regular compiler barriers. Further, update the comment to include a race free method of reading said hardware counter. The primary change is putting the pmc_read inside the seq-loop, otherwise we can still race and read rubbish. Noticed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.577951445@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:47 +02:00
Peter Zijlstra	5872bdb88a	perf_counter: add more context information Put in counts to tell which ips belong to what context. ----- \| \| hv \| -- nr \| \| kernel \| -- \| \| user ----- Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.493101305@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:46 +02:00
Peter Zijlstra	c457810ab4	perf_counter: per event wakeups By request, provide a way to request a wakeup every 'n' events instead of every page of output. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.323309784@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:45 +02:00
Peter Zijlstra	8a057d8491	perf_counter: move the event overflow output bits to record_type Per suggestion from Paul, move the event overflow bits to record_type and sanitize the enums a bit. Breaks the ABI -- again ;-) Suggested-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.151921176@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:45 +02:00
Peter Zijlstra	394ee07623	perf_counter: provide generic callchain bits Provide the generic callchain support bits. If hw_event->callchain is set the arch specific perf_callchain() function is called upon to provide a perf_callchain_entry structure filled with the current callchain. If it does so, it is added to the overflow output event. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171024.254266860@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:43 +02:00
Peter Zijlstra	5ed00415e3	perf_counter: re-arrange the perf_event_type Breaks ABI yet again :-) Change the event type so that [0, 2^31-1] are regular event types, but [2^31, 2^32-1] forms a bitmask for overflow events. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171024.047961770@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:42 +02:00
Peter Zijlstra	78d613eb12	perf_counter: small cleanup of the output routines Move the nmi argument to the _begin() function, so that _end() only needs the handle. This allows the _begin() function to generate a wakeup on event loss. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171023.959404268@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:41 +02:00
Paul Mackerras	d5d2bc0dd0	perf_counter: make it possible for hw_perf_counter_init to return error codes Impact: better error reporting At present, if hw_perf_counter_init encounters an error, all it can do is return NULL, which causes sys_perf_counter_open to return an EINVAL error to userspace. This isn't very informative for userspace; it means that userspace can't tell the difference between "sorry, oprofile is already using the PMU" and "we don't support this CPU" and "this CPU doesn't support the requested generic hardware event". This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let hw_perf_counter_init return an error code on error rather than just NULL if it wishes. If it does so, that error code will be returned from sys_perf_counter_open to userspace. If it returns NULL, an EINVAL error will be returned to userspace, as before. This also adapts the powerpc hw_perf_counter_init to make use of this to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate. It would be good to add extra error numbers in future to allow userspace to distinguish the various errors that are currently reported as EINVAL, i.e. irq_period < 0, too many events in a group, conflict between exclude_* settings in a group, and PMU resource conflict in a group. [ v2: fix a bug pointed out by Corey Ashford where error returns from hw_perf_counter_init were not handled correctly in the case of raw hardware events.] Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Orig-LKML-Reference: <20090330171023.682428180@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:40 +02:00
Peter Zijlstra	0a4a93919b	perf_counter: executable mmap() information Currently the profiling information returns userspace IPs but no way to correlate them to userspace code. Userspace could look into /proc/$pid/maps but that might not be current or even present anymore at the time of analyzing the IPs. Therefore provide means to track the mmap information and provide it in the output stream. XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Cc: Andrew Morton <akpm@linux-foundation.org> Orig-LKML-Reference: <20090330171023.417259499@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:38 +02:00
Peter Zijlstra	38ff667b32	perf_counter: fix update_userpage() It just occured to me it is possible to have multiple contending updates of the userpage (mmap information vs overflow vs counter). This would break the seqlock logic. It appear the arch code uses this from NMI context, so we cannot possibly serialize its use, therefore separate the data_head update from it and let it return to its original use. The arch code needs to make sure there are no contending callers by disabling the counter before using it -- powerpc appears to do this nicely. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171023.241410660@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:37 +02:00
Peter Zijlstra	925d519ab8	perf_counter: unify and fix delayed counter wakeup While going over the wakeup code I noticed delayed wakeups only work for hardware counters but basically all software counters rely on them. This patch unifies and generalizes the delayed wakeup to fix this issue. Since we're dealing with NMI context bits here, use a cmpxchg() based single link list implementation to track counters that have pending wakeups. [ This should really be generic code for delayed wakeups, but since we cannot use cmpxchg()/xchg() in generic code, I've let it live in the perf_counter code. -- Eric Dumazet could use it to aggregate the network wakeups. ] Furthermore, the x86 method of using TIF flags was flawed in that its quite possible to end up setting the bit on the idle task, loosing the wakeup. The powerpc method uses per-cpu storage and does appear to be sufficient. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171023.153932974@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:36 +02:00
Paul Mackerras	53cfbf5937	perf_counter: record time running and time enabled for each counter Impact: new functionality Currently, if there are more counters enabled than can fit on the CPU, the kernel will multiplex the counters on to the hardware using round-robin scheduling. That isn't too bad for sampling counters, but for counting counters it means that the value read from a counter represents some unknown fraction of the true count of events that occurred while the counter was enabled. This remedies the situation by keeping track of how long each counter is enabled for, and how long it is actually on the cpu and counting events. These times are recorded in nanoseconds using the task clock for per-task counters and the cpu clock for per-cpu counters. These values can be supplied to userspace on a read from the counter. Userspace requests that they be supplied after the counter value by setting the PERF_FORMAT_TOTAL_TIME_ENABLED and/or PERF_FORMAT_TOTAL_TIME_RUNNING bits in the hw_event.read_format field when creating the counter. (There is no way to change the read format after the counter is created, though it would be possible to add some way to do that.) Using this information it is possible for userspace to scale the count it reads from the counter to get an estimate of the true count: true_count_estimate = count * total_time_enabled / total_time_running This also lets userspace detect the situation where the counter never got to go on the cpu: total_time_running == 0. This functionality has been requested by the PAPI developers, and will be generally needed for interpreting the count values from counting counters correctly. In the implementation, this keeps 5 time values (in nanoseconds) for each counter: total_time_enabled and total_time_running are used when the counter is in state OFF or ERROR and for reporting back to userspace. When the counter is in state INACTIVE or ACTIVE, it is the tstamp_enabled, tstamp_running and tstamp_stopped values that are relevant, and total_time_enabled and total_time_running are determined from them. (tstamp_stopped is only used in INACTIVE state.) The reason for doing it like this is that it means that only counters being enabled or disabled at sched-in and sched-out time need to be updated. There are no new loops that iterate over all counters to update total_time_enabled or total_time_running. This also keeps separate child_total_time_running and child_total_time_enabled fields that get added in when reporting the totals to userspace. They are separate fields so that they can be atomic. We don't want to use atomics for total_time_running, total_time_enabled etc., because then we would have to use atomic sequences to update them, which are slower than regular arithmetic and memory accesses. It is possible to measure total_time_running by adding a task_clock counter to each group of counters, and total_time_enabled can be measured approximately with a top-level task_clock counter (though inaccuracies will creep in if you need to disable and enable groups since it is not possible in general to disable/enable the top-level task_clock counter simultaneously with another group). However, that adds extra overhead - I measured around 15% increase in the context switch latency reported by lat_ctx (from lmbench) when a task_clock counter was added to each of 2 groups, and around 25% increase when a task_clock counter was added to each of 4 groups. (In both cases a top-level task-clock counter was also added.) In contrast, the code added in this commit gives better information with no overhead that I could measure (in fact in some cases I measured lower times with this code, but the differences were all less than one standard deviation). [ v2: address review comments by Andrew Morton. ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Orig-LKML-Reference: <18890.6578.728637.139402@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:36 +02:00
Peter Zijlstra	7730d86558	perf_counter: allow and require one-page mmap on counting counters A brainfart stopped single page mmap()s working. The rest of the code should be perfectly fine with not having any data pages. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Orig-LKML-Reference: <1237981712.7972.812.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:35 +02:00
Peter Zijlstra	ea5d20cf99	perf_counter: optionally provide the pid/tid of the sampled task Allow cpu wide counters to profile userspace by providing what process the sample belongs to. This raises the first issue with the output type, lots of these options: group, tid, callchain, etc.. are non-exclusive and could be combined, suggesting a bitfield. However, things like the mmap() data stream doesn't fit in that. How to split the type field... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Orig-LKML-Reference: <20090325113317.013775235@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:34 +02:00
Peter Zijlstra	63e35b25d6	perf_counter: sanity check on the output API Ensure we never write more than we said we would. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Orig-LKML-Reference: <20090325113316.921433024@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:33 +02:00
Peter Zijlstra	5c14819432	perf_counter: output objects Provide a {type,size} header for each output entry. This should provide extensible output, and the ability to mix multiple streams. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Orig-LKML-Reference: <20090325113316.831607932@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:33 +02:00
Peter Zijlstra	b9cacc7bf1	perf_counter: more elaborate write API Provide a begin, copy, end interface to the output buffer. begin() reserves the space, copy() copies the data over, considering page boundaries, end() finalizes the event and does the wakeup. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Orig-LKML-Reference: <20090325113316.740550870@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:32 +02:00
Peter Zijlstra	c7138f37f9	perf_counter: fix perf_poll() Impact: fix kerneltop 100% CPU usage Only return a poll event when there's actually been one, poll_wait() doesn't actually wait for the waitq you pass it, it only enqueues you on it. Only once all FDs have been iterated and none of thm returned a poll-event will it schedule(). Also make it return POLL_HUP when there's not mmap() area to read from. Further, fix a silly bug in the write code. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Arjan van de Ven <arjan@infradead.org> Orig-LKML-Reference: <1237897096.24918.181.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:32 +02:00
Peter Zijlstra	7b732a7504	perf_counter: new output ABI - part 1 Impact: Rework the perfcounter output ABI use sys_read() only for instant data and provide mmap() output for all async overflow data. The first mmap() determines the size of the output buffer. The mmap() size must be a PAGE_SIZE multiple of 1+pages, where pages must be a power of 2 or 0. Further mmap()s of the same fd must have the same size. Once all maps are gone, you can again mmap() with a new size. In case of 0 extra pages there is no data output and the first page only contains meta data. When there are data pages, a poll() event will be generated for each full page of data. Furthermore, the output is circular. This means that although 1 page is a valid configuration, its useless, since we'll start overwriting it the instant we report a full page. Future work will focus on the output format (currently maintained) where we'll likey want each entry denoted by a header which includes a type and length. Further future work will allow to splice() the fd, also containing the async overflow data -- splice() would be mutually exclusive with mmap() of the data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090323172417.470536358@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:27 +02:00
Paul Mackerras	37d8182838	perf_counter: add an mmap method to allow userspace to read hardware counters Impact: new feature giving performance improvement This adds the ability for userspace to do an mmap on a hardware counter fd and get access to a read-only page that contains the information needed to translate a hardware counter value to the full 64-bit counter value that would be returned by a read on the fd. This is useful on architectures that allow user programs to read the hardware counters, such as PowerPC. The mmap will only succeed if the counter is a hardware counter monitoring the current process. On my quad 2.5GHz PowerPC 970MP machine, userspace can read a counter and translate it to the full 64-bit value in about 30ns using the mmapped page, compared to about 830ns for the read syscall on the counter, so this does give a significant performance improvement. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Orig-LKML-Reference: <20090323172417.297057964@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:26 +02:00
Peter Zijlstra	96f6d44443	perf_counter: avoid recursion Tracepoint events like lock_acquire and software counters like pagefaults can recurse into the perf counter code again, avoid that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090323172417.152096433@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:25 +02:00
Peter Zijlstra	f4a2deb486	perf_counter: remove the event config bitfields Since the bitfields turned into a bit of a mess, remove them and rely on good old masks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090323172417.059499915@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:25 +02:00
Peter Zijlstra	0322cd6ec5	perf_counter: unify irq output code Impact: cleanup Having 3 slightly different copies of the same code around does nobody any good. First step in revamping the output format. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Orig-LKML-Reference: <20090319194233.929962222@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:17 +02:00
Peter Zijlstra	b8e83514b6	perf_counter: revamp syscall input ABI Impact: modify ABI The hardware/software classification in hw_event->type became a little strained due to the addition of tracepoint tracing. Instead split up the field and provide a type field to explicitly specify the counter type, while using the event_id field to specify which event to use. Raw counters still work as before, only the raw config now goes into raw_event. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Orig-LKML-Reference: <20090319194233.836807573@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:17 +02:00
Peter Zijlstra	e077df4f43	perf_counter: hook up the tracepoint events Impact: new perfcounters feature Enable usage of tracepoints as perf counter events. tracepoint event ids can be found in /debug/tracing/event///id and (for now) are represented as -65536+id in the type field. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Orig-LKML-Reference: <20090319194233.744044174@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:16 +02:00
Peter Zijlstra	f160095275	perf_counter: fix up counter free paths Impact: fix crash during perfcounters use I found another counter free path, create a free_counter() call to accomodate generic tear-down. Fixes an RCU bug. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Orig-LKML-Reference: <20090319194233.652078652@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:16 +02:00
Peter Zijlstra	4a0deca657	perf_counter: generic context switch event Impact: cleanup Use the generic software events for context switches. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Orig-LKML-Reference: <20090319194233.283522645@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:15 +02:00
Peter Zijlstra	01ef09d9ff	perf_counter: fix uninitialized usage of event_list Impact: fix boot crash When doing the generic context switch event I ran into some early boot hangs, which were caused by inf func recursion (event, fault, event, fault). I eventually tracked it down to event_list not being initialized at the time of the first event. Fix this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Orig-LKML-Reference: <20090319194233.195392657@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:15 +02:00
Paul Mackerras	b6c5a71da1	perf_counter: abstract wakeup flag setting in core to fix powerpc build Impact: build fix for powerpc Commit bd753921015e7905 ("perf_counter: software counter event infrastructure") introduced a use of TIF_PERF_COUNTERS into the core perfcounter code. This breaks the build on powerpc because we use a flag in a per-cpu area to signal wakeups on powerpc rather than a thread_info flag, because the thread_info flags have to be manipulated with atomic operations and are thus slower than per-cpu flags. This fixes the by changing the core to use an abstracted set_perf_counter_pending() function, which is defined on x86 to set the TIF_PERF_COUNTERS flag and on powerpc to set the per-cpu flag (paca->perf_counter_pending). It changes the previous powerpc definition of set_perf_counter_pending to not take an argument and adds a clear_perf_counter_pending, so as to simplify the definition on x86. On x86, set_perf_counter_pending() is defined as a macro. Defining it as a static inline in arch/x86/include/asm/perf_counters.h causes compile failures because <asm/perf_counters.h> gets included early in <linux/sched.h>, and the definitions of set_tsk_thread_flag etc. are therefore not available in <asm/perf_counters.h>. (On powerpc this problem is avoided by defining set_perf_counter_pending etc. in <asm/hw_irq.h>.) Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-04-06 09:30:14 +02:00
Tim Blechmann	4e193bd4df	perf_counter: include missing header Impact: build fix In order to compile a kernel with performance counter patches, <asm/irq_regs.h> has to be included to provide the declaration of struct pt_regs *get_irq_regs(void); [ This bug was masked by unrelated x86 header file changes in the x86 tree, but occurs in the tip:perfcounters/core standalone tree. ] Signed-off-by: Tim Blechmann <tim@klingt.org> Orig-LKML-Reference: <20090314142925.49c29c17@thinkpad> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:13 +02:00
Peter Zijlstra	039fc91e06	perf_counter: fix hrtimer sampling Impact: fix deadlock with perfstat Fix for the perfstat fubar.. We cannot unconditionally call hrtimer_cancel() without ever having done hrtimer_init() on the thing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Orig-LKML-Reference: <1236959027.22447.149.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:46 +02:00
Peter Zijlstra	592903cdcb	perf_counter: add an event_list I noticed that the counter_list only includes top-level counters, thus perf_swcounter_event() will miss sw-counters in groups. Since perf_swcounter_event() also wants an RCU safe list, create a new event_list that includes all counters and uses RCU list ops and use call_rcu to free the counter structure. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:43 +02:00
Peter Zijlstra	d6d020e995	perf_counter: hrtimer based sampling for software time events Use hrtimers to profile timer based sampling for the software time counters. This allows platforms without hardware counter support to still perform sample based profiling. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:41 +02:00
Peter Zijlstra	ac17dc8e58	perf_counter: provide major/minor page fault software events Provide separate sw counters for major and minor page faults. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:40 +02:00
Peter Zijlstra	7dd1fcc258	perf_counter: provide pagefault software events We use the generic software counter infrastructure to provide page fault events. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:37 +02:00
Peter Zijlstra	15dbf27cc1	perf_counter: software counter event infrastructure Provide generic software counter infrastructure that supports software events. This will be used to allow sample based profiling based on software events such as pagefaults. The current infrastructure can only provide a count of such events, no place information. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:36 +02:00
Peter Zijlstra	755642322a	perf_counter: use list_move_tail() Instead of del/add use a move list-op. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:29:31 +02:00
Paul Mackerras	2743a5b0fa	perfcounters: provide expansion room in the ABI Impact: ABI change This expands several fields in the perf_counter_hw_event struct and adds a "flags" argument to the perf_counter_open system call, in order that features can be added in future without ABI changes. In particular the record_type field is expanded to 64 bits, and the space for flag bits has been expanded from 32 to 64 bits. This also adds some new fields: * read_format (64 bits) is intended to provide a way to specify what userspace wants to get back when it does a read() on a simple (non-interrupting) counter; * exclude_idle (1 bit) provides a way for userspace to ask that events that occur when the cpu is idle be excluded; * extra_config_len will provide a way for userspace to supply an arbitrary amount of extra machine-specific PMU configuration data immediately following the perf_counter_hw_event struct, to allow sophisticated users to program things such as instruction matching CAMs and address range registers; * __reserved_3 and __reserved_4 provide space for future expansion. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-03-04 20:36:51 +11:00
Paul Mackerras	f3dfd2656d	perfcounters: fix a few minor cleanliness issues This fixes three issues noticed by Arnd Bergmann: - Add #ifdef __KERNEL__ and move some things around in perf_counter.h to make sure only the bits that userspace needs are exported to userspace. - Use __u64, __s64, __u32 types in the structs exported to userspace rather than u64, s64, u32. - Make the sys_perf_counter_open syscall available to the SPUs on Cell platforms. And one issue that I noticed in looking at the code again: - Wrap the perf_counter_open syscall with SYSCALL_DEFINE4 so we get the proper handling of int arguments on ppc64 (and some other 64-bit architectures). Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-02-26 22:43:46 +11:00
Paul Mackerras	c07c99b672	perfcounters: make context switch and migration software counters work again Jaswinder Singh Rajput reported that commit `23a185ca8a` caused the context switch and migration software counters to report zero always. With that commit, the software counters only count events that occur between sched-in and sched-out for a task. This is necessary for the counter enable/disable prctls and ioctls to work. However, the context switch and migration counts are incremented after sched-out for one task and before sched-in for the next. Since the increment doesn't occur while a task is scheduled in (as far as the software counters are concerned) it doesn't count towards any counter. Thus the context switch and migration counters need to count events that occur at any time, provided the counter is enabled, not just those that occur while the task is scheduled in (from the perf_counter subsystem's point of view). The problem though is that the software counter code can't tell the difference between being enabled and being scheduled in, and between being disabled and being scheduled out, since we use the one pair of enable/disable entry points for both. That is, the high-level disable operation simply arranges for the counter to not be scheduled in any more, and the high-level enable operation arranges for it to be scheduled in again. One way to solve this would be to have sched_in/out operations in the hw_perf_counter_ops struct as well as enable/disable. However, this takes a simpler approach: it adds a 'prev_state' field to the perf_counter struct that allows a counter's enable method to know whether the counter was previously disabled or just inactive (scheduled out), and therefore whether the enable method is being called as a result of a high-level enable or a schedule-in operation. This then allows the context switch, migration and page fault counters to reset their hw.prev_count value in their enable functions only if they are called as a result of a high-level enable operation. Although page faults would normally only occur while the counter is scheduled in, this changes the page fault counter code too in case there are ever circumstances where page faults get counted against a task while its counters are not scheduled in. Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-02-13 12:20:38 +01:00
Paul Mackerras	4bcf349a0f	perfcounters: fix refcounting bug, take 2 Only free child_counter if it has a parent; if it doesn't, then it has a file pointing to it and we'll free it in perf_release. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-02-11 14:08:44 +01:00
Mike Galbraith	5af759176c	perfcounters: fix use after free in perf_release() running... while true; do foo -d 1 -f 1 -c 100000 & sleep 1 kerneltop -d 1 -f 1 -e 1 -c 25000 -p `pidof foo` done while true; do killall foo; killall kerneltop; sleep 2 done ...in two shells with SLUB_DEBUG enabled produces flood of: BUG task_struct: Poison overwritten. Fix the use-after-free bug in perf_release(). Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-02-11 11:30:10 +01:00
Paul Mackerras	0475f9ea8e	perf_counters: allow users to count user, kernel and/or hypervisor events Impact: new perf_counter feature This extends the perf_counter_hw_event struct with bits that specify that events in user, kernel and/or hypervisor mode should not be counted (i.e. should be excluded), and adds code to program the PMU mode selection bits accordingly on x86 and powerpc. For software counters, we don't currently have the infrastructure to distinguish which mode an event occurs in, so we currently fail the counter initialization if the setting of the hw_event.exclude_* bits would require us to distinguish. Context switches and CPU migrations are currently considered to occur in kernel mode. On x86, this changes the previous policy that only root can count kernel events. Now non-root users can count kernel events or exclude them. Non-root users still can't use NMI events, though. On x86 we don't appear to have any way to control whether hypervisor events are counted or not, so hw_event.exclude_hv is ignored. On powerpc, the selection of whether to count events in user, kernel and/or hypervisor mode is PMU-wide, not per-counter, so this adds a check that the hw_event.exclude_* settings are the same as other events on the PMU. Counters being added to a group have to have the same settings as the other hardware counters in the group. Counters and groups can only be enabled in hw_perf_group_sched_in or power_perf_enable if they have the same settings as any other counters already on the PMU. If we are not running on a hypervisor, the exclude_hv setting is ignored (by forcing it to 0) since we can't ever get any hypervisor events. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-02-11 15:06:59 +11:00
Paul Mackerras	23a185ca8a	perf_counters: make software counters work as per-cpu counters Impact: kernel crash fix Yanmin Zhang reported that using a PERF_COUNT_TASK_CLOCK software counter as a per-cpu counter would reliably crash the system, because it calls __task_delta_exec with a null pointer. The page fault, context switch and cpu migration counters also won't function correctly as per-cpu counters since they reference the current task. This fixes the problem by redirecting the task_clock counter to the cpu_clock counter when used as a per-cpu counter, and by implementing per-cpu page fault, context switch and cpu migration counters. Along the way, this: - Initializes counter->ctx earlier, in perf_counter_alloc, so that sw_perf_counter_init can use it - Adds code to kernel/sched.c to count task migrations into each cpu, in rq->nr_migrations_in - Exports the per-cpu context switch and task migration counts via new functions added to kernel/sched.c - Makes sure that if sw_perf_counter_init fails, we don't try to initialize the counter as a hardware counter. Since the user has passed a negative, non-raw event type, they clearly don't intend for it to be interpreted as a hardware event. Reported-by: "Zhang Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-02-09 12:47:16 +01:00
Mike Galbraith	65d370862f	perfcounters: fix refcounting bug don't kfree in use counters. Running... while true; do perfstat -e 1 -c true; done ...on all cores for a while doesn't seem to be eating ram, and my oops is gone. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-29 14:25:23 +01:00
Paul Mackerras	d859e29fe3	perf_counter: Add counter enable/disable ioctls Impact: New perf_counter features This primarily adds a way for perf_counter users to enable and disable counters and groups. Enabling or disabling a counter or group also enables or disables all of the child counters that have been cloned from it to monitor children of the task monitored by the top-level counter. The userspace interface to enable/disable counters is via ioctl on the counter file descriptor. Along the way this extends the code that handles child counters to handle child counter groups properly. A group with multiple counters will be cloned to child tasks if and only if the group leader has the hw_event.inherit bit set - if it is set the whole group is cloned as a group in the child task. In order to be able to enable or disable all child counters of a given top-level counter, we need a way to find them all. Hence I have added a child_list field to struct perf_counter, which is the head of the list of children for a top-level counter, or the link in that list for a child counter. That list is protected by the perf_counter.mutex field. This also adds a mutex to the perf_counter_context struct. Previously the list of counters was protected just by the lock field in the context, which meant that perf_counter_init_task had to take that lock and then take whatever lock/mutex protects the top-level counter's child_list. But the counter enable/disable functions need to take that lock in order to traverse the list, then for each counter take the lock in that counter's context in order to change the counter's state safely, which will lead to a deadlock. To solve this, we now have both a mutex and a spinlock in the context, and taking either is sufficient to ensure the list of counters can't change - you have to take both before changing the list. Now perf_counter_init_task takes the mutex instead of the lock (which incidentally means that inherit_counter can use GFP_KERNEL instead of GFP_ATOMIC) and thus avoids the possible deadlock. Similarly the new enable/disable functions can take the mutex while traversing the list of child counters without incurring a possible deadlock when the counter manipulation code locks the context for a child counter. We also had an misfeature that the first counter added to a context would possibly not go on until the next sched-in, because we were using ctx->nr_active to detect if the context was running on a CPU. But nr_active is the number of active counters, and if that was zero (because the context didn't have any counters yet) it would look like the context wasn't running on a cpu and so the retry code in __perf_install_in_context wouldn't retry. So this adds an 'is_active' field that is set when the context is on a CPU, even if it has no counters. The is_active field is only used for task contexts, not for per-cpu contexts. If we enable a subsidiary counter in a group that is active on a CPU, and the arch code can't enable the counter, then we have to pull the whole group off the CPU. We do this with group_sched_out, which gets moved up in the file so it comes before all its callers. This also adds similar logic to __perf_install_in_context so that the "all on, or none" invariant of groups is preserved when adding a new counter to a group. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-17 18:10:22 +11:00
Paul Mackerras	3b6f9e5cb2	perf_counter: Add support for pinned and exclusive counter groups Impact: New perf_counter features A pinned counter group is one that the user wants to have on the CPU whenever possible, i.e. whenever the associated task is running, for a per-task group, or always for a per-cpu group. If the system cannot satisfy that, it puts the group into an error state where it is not scheduled any more and reads from it return EOF (i.e. 0 bytes read). The group can be released from error state and made readable again using prctl(PR_TASK_PERF_COUNTERS_ENABLE). When we have finer-grained enable/disable controls on counters we'll be able to reset the error state on individual groups. An exclusive group is one that the user wants to be the only group using the CPU performance monitor hardware whenever it is on. The counter group scheduler will not schedule an exclusive group if there are already other groups on the CPU and will not schedule other groups onto the CPU if there is an exclusive group scheduled (that statement does not apply to groups containing only software counters, which can always go on and which do not prevent an exclusive group from going on). With an exclusive group, we will be able to let users program PMU registers at a low level without the concern that those settings will perturb other measurements. Along the way this reorganizes things a little: - is_software_counter() is moved to perf_counter.h. - cpuctx->active_oncpu now records the number of hardware counters on the CPU, i.e. it now excludes software counters. Nothing was reading cpuctx->active_oncpu before, so this change is harmless. - A new cpuctx->exclusive field records whether we currently have an exclusive group on the CPU. - counter_sched_out moves higher up in perf_counter.c and gets called from __perf_counter_remove_from_context and __perf_counter_exit_task, where we used to have essentially the same code. - __perf_counter_sched_in now goes through the counter list twice, doing the pinned counters in the first loop and the non-pinned counters in the second loop, in order to give the pinned counters the best chance to be scheduled in. Note that only a group leader can be exclusive or pinned, and that attribute applies to the whole group. This avoids some awkwardness in some corner cases (e.g. where a group leader is closed and the other group members get added to the context list). If we want to relax that restriction later, we can, and it is easier to relax a restriction than to apply a new one. This doesn't yet handle the case where a pinned counter is inherited and goes into error state in the child - the error state is not propagated up to the parent when the child exits, and arguably it should. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-14 21:00:30 +11:00
Paul Mackerras	01d0287f06	powerpc/perf_counter: Make sure PMU gets enabled properly This makes sure that we call the platform-specific ppc_md.enable_pmcs function on each CPU before we try to use the PMU on that CPU. If the CPU goes off-line and then on-line, we need to do the enable_pmcs call again, so we use the hw_perf_counter_setup hook to ensure that. It gets called as each CPU comes online, but it isn't called on the CPU that is coming up, so this adds the CPU number as an argument to it (there were no non-empty instances of hw_perf_counter_setup before). This also arranges to set the pmcregs_in_use field of the lppaca (data structure shared with the hypervisor) on each CPU when we are using the PMU and clear it when we are not. This allows the hypervisor to optimize partition switches by not saving/restoring the PMU registers when we aren't using the PMU. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-14 13:44:19 +11:00
Paul Mackerras	dd0e6ba22e	perf_counter: Always schedule all software counters in Software counters aren't subject to the limitations imposed by the fixed number of hardware counter registers, so there is no reason not to enable them all in __perf_counter_sched_in. Previously we used to break out of the loop when we got to a group that wouldn't fit on the PMU; with this we continue through the list but only schedule in software counters (or groups containing only software counters) from there on. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-12 15:12:50 +11:00
Paul Mackerras	4eb96fcfe0	perf_counter: Add dummy perf_counter_print_debug function Impact: minimize requirements on architectures Currently, an architecture just enabling CONFIG_PERF_COUNTERS but not providing any extra functions will fail to build with perf_counter_print_debug being undefined, since we don't provide an empty dummy definition like we do with the hw_perf_* functions. This provides an empty dummy perf_counter_print_debug() to make it easier for architectures to turn on CONFIG_PERF_COUNTERS. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-09 17:24:34 +11:00
Paul Mackerras	3cbed429a9	perf_counter: Add optional hw_perf_group_sched_in arch function Impact: extend perf_counter infrastructure This adds an optional hw_perf_group_sched_in() arch function that enables a whole group of counters in one go. It returns 1 if it added the group successfully, 0 if it did nothing (and therefore the core needs to add the counters individually), or a negative number if an error occurred. It should add all the counters and enable any software counters in the group, or else add none of them and return an error. There are a couple of related changes/improvements in the group handling here: * As an optimization, group_sched_out() and group_sched_in() now check the state of the group leader, and do nothing if the leader is not active or disabled. * We now call hw_perf_save_disable/hw_perf_restore around the complete set of counter enable/disable calls in __perf_counter_sched_in/out, to give the arch code the opportunity to defer updating the hardware state until the hw_perf_restore call if it wants. * We no longer stop adding groups after we get to a group that has more than one counter. We will ultimately add an option for a group to be exclusive. The current code doesn't really implement exclusive groups anyway, since a group could end up going on with other counters that get added before it. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-09 16:43:42 +11:00
Paul Mackerras	9abf8a08bc	perf_counter: Fix the cpu_clock software counter Impact: bug fix Currently if you do (e.g.) timec -e -1 ls, it will report 0 for the value of the cpu_clock counter. The reason is that the core assumes that a counter's count field is up-to-date when the counter is inactive, and doesn't call the counter's read function. However, the cpu_clock counter code only updates the count in the read function. This fixes it by making both the read and disable functions update the count. It also makes the counter ignore time passing while the counter is disabled, by making the enable function update the hw.prev_count field. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-09 16:26:43 +11:00
Paul Mackerras	ff6f05416e	perf_counter: Fix return value from dummy hw_perf_counter_init Impact: fix oops-causing bug Currently, if you try to use perf_counters on an architecture that has no hardware support, and you select an event that doesn't map to any of the defined software counters, you get an oops rather than an error. This is because the dummy hw_perf_counter_init returns ERR_PTR(-EINVAL) but the caller (perf_counter_alloc) only tests for NULL. This makes the dummy hw_perf_counter_init return NULL instead. Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-01-09 16:19:25 +11:00
Yinghai Lu	01ea1ccaa2	perf_counter: more barrier in blank weak function Impact: fix panic possible panic Some versions of GCC inline the weak global function if it's empty. Add a barrier() to work it around. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-27 11:58:48 +01:00
Ingo Molnar	235c7fc7c5	perfcounters: generalize the counter scheduler Impact: clean up and refactor code refactor the counter scheduler: separate out in/out functions and introduce a counter-rotation function as well. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:23 +01:00
Ingo Molnar	8fe91e61cd	perfcounters: remove ->nr_inherited Impact: remove dead code nr_inherited was not maintained correctly (not decremented) - and also not used - remove it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:22 +01:00
Ingo Molnar	95cdd2e785	perfcounters: enable lowlevel pmc code to schedule counters Allow lowlevel ->enable() op to return an error if a counter can not be added. This can be used to handle counter constraints. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:19 +01:00
Ingo Molnar	aa9c4c0f96	perfcounters: fix task clock counter Impact: fix per task clock counter precision Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:14 +01:00
Ingo Molnar	7671581f16	perfcounters: hw ops rename Impact: rename field names Shorten them. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:13 +01:00
Ingo Molnar	7995888fcb	perfcounters: tweak group scheduling Impact: schedule in groups atomically If there are multiple groups in a task, make sure they are scheduled in and out atomically. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:09 +01:00
Ingo Molnar	8fb9331391	perfcounters: remove warnings Impact: remove debug checks Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-23 12:45:08 +01:00
Ingo Molnar	a86ed50859	perfcounters: use hw_event.disable flag Impact: implement default-off counters Make sure that counters that are created with counter.hw_event.disabled=1, get created in disabled state. They can be enabled via: prctl(PR_TASK_PERF_COUNTERS_ENABLE); Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-17 01:02:21 +01:00
Ingo Molnar	0cc0c027d4	perfcounters: release CPU context when exiting task counters If counters are exiting via do_exit() not via filp close, then the CPU context needs to be released - otherwise future percpu counter creations might fail. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 23:25:02 +01:00
Ingo Molnar	088e2852c8	perfcounters, x86: fix sw counters on non-PMC CPUs Make perf_max_counters default to at least 1 - this allows the sw counters to be used. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:31:29 +01:00
Ingo Molnar	e06c61a879	perfcounters: add nr-of-faults counter Impact: add new feature, new sw counter Add a counter that counts the number of pagefaults a task is experiencing. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:31:27 +01:00
Ingo Molnar	6c594c21fc	perfcounters: add task migrations counter Impact: add new feature, new sw counter Add a counter that counts the number of cross-CPU migrations a task is suffering. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:31:26 +01:00
Ingo Molnar	5d6a27d8a0	perfcounters: add context switch counter Impact: add new feature, new sw counter Add a counter that counts the number of context-switches a task is doing. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:31:23 +01:00
Ingo Molnar	8cb391e878	perfcounters: fix task clock counter Impact: bugfix Update the task clock counter to the new math. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:30:50 +01:00
Ingo Molnar	9b51f66dcb	perfcounters: implement "counter inheritance" Impact: implement new performance feature Counter inheritance can be used to run performance counters in a workload, transparently - and pipe back the counter results to the parent counter. Inheritance for performance counters works the following way: when creating a counter it can be marked with the .inherit=1 flag. Such counters are then 'inherited' by all child tasks (be they fork()-ed or clone()-ed). These counters get inherited through exec() boundaries as well (except through setuid boundaries). The counter values get added back to the parent counter(s) when the child task(s) exit - much like stime/utime statistics are gathered. So inherited counters are ideal to gather summary statistics about an application's behavior via shell commands, without having to modify that application. The timec.c command utilizes counter inheritance: http://redhat.com/~mingo/perfcounters/timec.c Sample output: $ ./timec -e 1 -e 3 -e 5 ls -lR /usr/include/ >/dev/null Performance counter stats for 'ls': 163516953 instructions 2295 cache-misses 2855182 branch-misses Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:30:49 +01:00
Ingo Molnar	ee06094f82	perfcounters: restructure x86 counter math Impact: restructure code Change counter math from absolute values to clear delta logic. We try to extract elapsed deltas from the raw hw counter - and put that into the generic counter. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-14 20:30:48 +01:00
Ingo Molnar	6a930700c8	perf counters: clean up state transitions Impact: cleanup Introduce a proper enum for the 3 states of a counter: PERF_COUNTER_STATE_OFF = -1 PERF_COUNTER_STATE_INACTIVE = 0 PERF_COUNTER_STATE_ACTIVE = 1 and rename counter->active to counter->state and propagate the changes everywhere. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:56 +01:00
Ingo Molnar	1d1c7ddbfa	perf counters: add prctl interface to disable/enable counters Add a way for self-monitoring tasks to disable/enable counters summarily, via a prctl: PR_TASK_PERF_COUNTERS_DISABLE 31 PR_TASK_PERF_COUNTERS_ENABLE 32 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:55 +01:00
Ingo Molnar	bae43c9945	perf counters: implement PERF_COUNT_TASK_CLOCK Impact: add new perf-counter type The 'task clock' counter counts the amount of time a task is executing, in nanoseconds. It stops ticking when a task is scheduled out either due to it blocking, sleeping or it being preempted. This counter type is a Linux kernel based abstraction, it is available even if the hardware does not support native hardware performance counters. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:54 +01:00
Ingo Molnar	01b2838c42	perf counters: consolidate hw_perf save/restore APIs Impact: cleanup Rename them to better match up the usual IRQ disable/enable APIs: hw_perf_disable_all() => hw_perf_save_disable() hw_perf_restore_ctrl() => hw_perf_restore() Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:53 +01:00
Ingo Molnar	5c92d12411	perf counters: implement PERF_COUNT_CPU_CLOCK Impact: add new perf-counter type The 'CPU clock' counter counts the amount of CPU clock time that is elapsing, in nanoseconds. (regardless of how much of it the task is spending on a CPU executing) This counter type is a Linux kernel based abstraction, it is available even if the hardware does not support native hardware performance counters. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:52 +01:00
Ingo Molnar	621a01eac8	perf counters: hw driver API Impact: restructure code, introduce hw_ops driver abstraction Introduce this abstraction to handle counter details: struct hw_perf_counter_ops { void (hw_perf_counter_enable) (struct perf_counter counter); void (hw_perf_counter_disable) (struct perf_counter counter); void (hw_perf_counter_read) (struct perf_counter counter); }; This will be useful to support assymetric hw details, and it will also be useful to implement "software counters". (Counters that count kernel managed sw events such as pagefaults, context-switches, wall-clock time or task-local time.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:51 +01:00
Ingo Molnar	ccff286d85	perf counters: group counter, fixes Impact: bugfix Check that a group does not span outside the context of a CPU or a task. Also, do not allow deep recursive hierarchies. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:50 +01:00
Ingo Molnar	04289bb989	perf counters: add support for group counters Impact: add group counters This patch adds the "counter groups" abstraction. Groups of counters behave much like normal 'single' counters, with a few semantic and behavioral extensions on top of that. A counter group is created by creating a new counter with the open() syscall's group-leader group_fd file descriptor parameter pointing to another, already existing counter. Groups of counters are scheduled in and out in one atomic group, and they are also roundrobin-scheduled atomically. Counters that are member of a group can also record events with an (atomic) extended timestamp that extends to all members of the group, if the record type is set to PERF_RECORD_GROUP. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:49 +01:00
Ingo Molnar	9f66a3810f	perf counters: restructure the API Impact: clean up new API Thorough cleanup of the new perf counters API, we now get clean separation of the various concepts: - introduce perf_counter_hw_event to separate out the event source details - move special type flags into separate attributes: PERF_COUNT_NMI, PERF_COUNT_RAW - extend the type to u64 and reserve it fully to the architecture in the raw type case. And make use of all these changes in the core and x86 perfcounters code. Also change the syscall signature to: asmlinkage int sys_perf_counter_open( struct perf_counter_hw_event *hw_event_uptr __user, pid_t pid, int cpu, int group_fd); ( Note that group_fd is unused for now - it's reserved for the counter groups abstraction. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:48 +01:00
Thomas Gleixner	dfa7c899b4	perf counters: expand use of counter->event Impact: change syscall, cleanup Make use of the new perf_counters event type. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:47 +01:00
Thomas Gleixner	eab656ae04	perf counters: clean up 'raw' type API Impact: cleanup Introduce a separate hw_event type. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-11 15:45:46 +01:00
Thomas Gleixner	0793a61d4d	performance counters: core code Implement the core kernel bits of Performance Counters subsystem. The Linux Performance Counter subsystem provides an abstraction of performance counter hardware capabilities. It provides per task and per CPU counters, and it provides event capabilities on top of those. Performance counters are accessed via special file descriptors. There's one file descriptor per virtual counter used. The special file descriptor is opened via the perf_counter_open() system call: int perf_counter_open(u32 hw_event_type, u32 hw_event_period, u32 record_type, pid_t pid, int cpu); The syscall returns the new fd. The fd can be used via the normal VFS system calls: read() can be used to read the counter, fcntl() can be used to set the blocking mode, etc. Multiple counters can be kept open at a time, and the counters can be poll()ed. See more details in Documentation/perf-counters.txt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-12-08 15:47:03 +01:00

1 2 3 4 5

205 Commits