Pull cpuset changes from Tejun Heo:
"cpuset has always been rather odd about its configurations - a cgroup
right after creation didn't allow any task executions before
configuration, changing configuration in the parent modifies the
descendants irreversibly and so on. These behaviors are inherently
nasty and almost hostile against sharing the hierarchy with other
controllers making it very difficult to use in unified hierarchy.
Li is currently in the process of updating the behaviors for
__DEVEL__sane_behavior which is the bulk of changes in this pull
request. It isn't complete yet and the behaviors will change further
but all changes are gated behind sane_behavior. In the process, the
rather hairy work-item punting which was used to work around the
limitations of cgroup descendant iterator was simplified."
* 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: rename @cont to @cgrp
cpuset: fix to migrate mm correctly in a corner case
cpuset: allow to move tasks to empty cpusets
cpuset: allow to keep tasks in empty cpusets
cpuset: introduce effective_{cpumask|nodemask}_cpuset()
cpuset: record old_mems_allowed in struct cpuset
cpuset: remove async hotplug propagation work
cpuset: let hotplug propagation work wait for task attaching
cpuset: re-structure update_cpumask() a bit
cpuset: remove cpuset_test_cpumask()
cpuset: remove unnecessary variable in cpuset_attach()
cpuset: cleanup guarantee_online_{cpus|mems}()
cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
Pull cgroup changes from Tejun Heo:
"This pull request contains the following changes.
- cgroup_subsys_state (css) reference counting has been converted to
percpu-ref. css is what each resource controller embeds into its
own control structure and perform reference count against. It may
be used in hot paths of various subsystems and is similar to module
refcnt in that aspect. For example, block-cgroup's css refcnting
was showing up a lot in Mikulaus's device-mapper scalability work
and this should alleviate it.
- cgroup subtree iterator has been updated so that RCU read lock can
be released after grabbing reference. This allows simplifying its
users which requires blocking which used to build iteration list
under RCU read lock and then traverse it outside. This pull
request contains simplification of cgroup core and device-cgroup.
A separate pull request will update cpuset.
- Fixes for various bugs including corner race conditions and RCU
usage bugs.
- A lot of cleanups and some prepartory work for the planned unified
hierarchy support."
* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (48 commits)
cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy
cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options
cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()
cgroup: always use RCU accessors for protected accesses
cgroup: fix RCU accesses around task->cgroups
cgroup: fix RCU accesses to task->cgroups
cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()
cgroup: fix cgroupfs_root early destruction path
cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy
cgroup: implement for_each_[builtin_]subsys()
cgroup: move init_css_set initialization inside cgroup_mutex
cgroup: s/for_each_subsys()/for_each_root_subsys()/
cgroup: clean up find_css_set() and friends
cgroup: remove cgroup->actual_subsys_mask
cgroup: prefix global variables with "cgroup_"
cgroup: convert CFTYPE_* flags to enums
cgroup: rename cont to cgrp
cgroup: clean up cgroup_serial_nr_cursor
cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()
cgroup: make serial_nr_cursor available throughout cgroup.c
...
Pull workqueue changes from Tejun Heo:
"Surprisingly, Lai and I didn't break too many things implementing
custom pools and stuff last time around and there aren't any follow-up
changes necessary at this point.
The only change in this pull request is Viresh's patches to make some
per-cpu workqueues to behave as unbound workqueues dependent on a boot
param whose default can be configured via a config option. This leads
to higher processing overhead / lower bandwidth as more work items are
bounced across CPUs; however, it can lead to noticeable powersave in
certain configurations - ~10% w/ idlish constant workload on a
big.LITTLE configuration according to Viresh.
This is because per-cpu workqueues interfere with how the scheduler
perceives whether or not each CPU is idle by forcing pinned tasks on
them, which makes the scheduler's power-aware scheduling decisions
less effective.
Its effectiveness is likely less pronounced on homogenous
configurations and this type of optimization can probably be made
automatic; however, the changes are pretty minimal and the affected
workqueues are clearly marked, so it's an easy gain for some
configurations for the time being with pretty unintrusive changes."
* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
fbcon: queue work on power efficient wq
block: queue work on power efficient wq
PHYLIB: queue work on system_power_efficient_wq
workqueue: Add system wide power_efficient workqueues
workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
percpu memory on large machines:
Now we have a new state MODULE_STATE_UNFORMED, we can insert the
module into the list (and thus guarantee its uniqueness) before we
allocate the per-cpu region.
In my defence, it didn't actually say the patch did this. Just that
we "can".
This patch actually *does* it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Jim Hull <jim.hull@hp.com>
Cc: stable@kernel.org # 3.8
I have patches that will use tracing_open_generic_tr/tc() in other
files, but as they are not ready to be merged yet, and Fengguang Wu's
sparse scripts pointed out that these functions were not declared
anywhere, I'll make them static for now.
When these functions are required to be used elsewhere, I'll remove
the static then.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I have patches that will use tracer_tracing_on/off/is_on() in other
files, but as they are not ready to be merged yet, and Fengguang Wu's
sparse scripts pointed out that these functions were not declared
anywhere, I'll make them static for now.
When these functions are required to be used elsewhere, I'll remove
the static then.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While analyzing the code, I discovered that there's a potential race between
deleting a trace instance and setting events. There are a few races that can
occur if events are being traced as the buffer is being deleted. Mostly the
problem comes with freeing the descriptor used by the trace event callback.
To prevent problems like this, the events are disabled before the buffer is
deleted. The problem with the current solution is that the event_mutex is let
go between disabling the events and freeing the files, which means that the events
could be enabled again while the freeing takes place.
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull x86 RAS update from Ingo Molnar:
"The changes in this tree are:
- ACPI APEI (ACPI Platform Error Interface) improvements, by Chen
Gong
- misc MCE fixes/cleanups"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Update MCE severity condition check
mce: acpi/apei: Add comments to clarify usage of the various bitfields in the MCA subsystem
ACPI/APEI: Update einj documentation for param1/param2
ACPI/APEI: Add parameter check before error injection
ACPI, APEI, EINJ: Fix error return code in einj_init()
x86, mce: Fix "braodcast" typo
Pull scheduler updates from Ingo Molnar:
"The main changes:
- load-calculation cleanups and improvements, by Alex Shi
- various nohz related tidying up of statisics, by Frederic
Weisbecker
- factor out /proc functions to kernel/sched/proc.c, by Paul
Gortmaker
- simplify the RT policy scheduler, by Kirill Tkhai
- various fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
sched/debug: Remove CONFIG_FAIR_GROUP_SCHED mask
sched/debug: Fix formatting of /proc/<PID>/sched
sched: Fix typo in struct sched_avg member description
sched/fair: Fix typo describing flags in enqueue_entity
sched/debug: Add load-tracking statistics to task
sched: Change get_rq_runnable_load() to static and inline
sched/tg: Remove tg.load_weight
sched/cfs_rq: Change atomic64_t removed_load to atomic_long_t
sched/tg: Use 'unsigned long' for load variable in task group
sched: Change cfs_rq load avg to unsigned long
sched: Consider runnable load average in move_tasks()
sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task
sched: Update cpu load after task_tick
sched: Fix sleep time double accounting in enqueue entity
sched: Set an initial value of runnable avg for new forked task
sched: Move a few runnable tg variables into CONFIG_SMP
Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
sched: Don't mix use of typedef ctl_table and struct ctl_table
sched: Remove WARN_ON(!sd) from init_sched_groups_power()
sched: Fix memory leakage in build_sched_groups()
...
Pull perf updates from Ingo Molnar:
"Kernel improvements:
- watchdog driver improvements by Li Zefan
- Power7 CPI stack events related improvements by Sukadev Bhattiprolu
- event multiplexing via hrtimers and other improvements by Stephane
Eranian
- kernel stack use optimization by Andrew Hunter
- AMD IOMMU uncore PMU support by Suravee Suthikulpanit
- NMI handling rate-limits by Dave Hansen
- various hw_breakpoint fixes by Oleg Nesterov
- hw_breakpoint overflow period sampling and related signal handling
fixes by Jiri Olsa
- Intel Haswell PMU support by Andi Kleen
Tooling improvements:
- Reset SIGTERM handler in workload child process, fix from David
Ahern.
- Makefile reorganization, prep work for Kconfig patches, from Jiri
Olsa.
- Add automated make test suite, from Jiri Olsa.
- Add --percent-limit option to 'top' and 'report', from Namhyung
Kim.
- Sorting improvements, from Namhyung Kim.
- Expand definition of sysfs format attribute, from Michael Ellerman.
Tooling fixes:
- 'perf tests' fixes from Jiri Olsa.
- Make Power7 CPI stack events available in sysfs, from Sukadev
Bhattiprolu.
- Handle death by SIGTERM in 'perf record', fix from David Ahern.
- Fix printing of perf_event_paranoid message, from David Ahern.
- Handle realloc failures in 'perf kvm', from David Ahern.
- Fix divide by 0 in variance, from David Ahern.
- Save parent pid in thread struct, from David Ahern.
- Handle JITed code in shared memory, from Andi Kleen.
- Fixes for 'perf diff', from Jiri Olsa.
- Remove some unused struct members, from Jiri Olsa.
- Add missing liblk.a dependency for python/perf.so, fix from Jiri
Olsa.
- Respect CROSS_COMPILE in liblk.a, from Rabin Vincent.
- No need to do locking when adding hists in perf report, only 'top'
needs that, from Namhyung Kim.
- Fix alignment of symbol column in in the hists browser (top,
report) when -v is given, from NAmhyung Kim.
- Fix 'perf top' -E option behavior, from Namhyung Kim.
- Fix bug in isupper() and islower(), from Sukadev Bhattiprolu.
- Fix compile errors in bp_signal 'perf test', from Sukadev
Bhattiprolu.
... and more things"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (102 commits)
perf/x86: Disable PEBS-LL in intel_pmu_pebs_disable()
perf/x86: Fix shared register mutual exclusion enforcement
perf/x86/intel: Support full width counting
x86: Add NMI duration tracepoints
perf: Drop sample rate when sampling is too slow
x86: Warn when NMI handlers take large amounts of time
hw_breakpoint: Introduce "struct bp_cpuinfo"
hw_breakpoint: Simplify *register_wide_hw_breakpoint()
hw_breakpoint: Introduce cpumask_of_bp()
hw_breakpoint: Simplify the "weight" usage in toggle_bp_slot() paths
hw_breakpoint: Simplify list/idx mess in toggle_bp_slot() paths
perf/x86/intel: Add mem-loads/stores support for Haswell
perf/x86/intel: Support Haswell/v4 LBR format
perf/x86/intel: Move NMI clearing to end of PMI handler
perf/x86/intel: Add Haswell PEBS support
perf/x86/intel: Add simple Haswell PMU support
perf/x86/intel: Add Haswell PEBS record support
perf/x86/intel: Fix sparse warning
perf/x86/amd: AMD IOMMU Performance Counter PERF uncore PMU implementation
perf/x86/amd: Add IOMMU Performance Counter resource management
...
Pull core irq changes from Ingo Molnar:
"The main changes:
- generic-irqchip driver additions, cleanups and fixes
- 3 new irqchip drivers: ARMv7-M NVIC, TB10x and Marvell Orion SoCs
- irq_get_trigger_type() simplification and cross-arch cleanup
- various cleanups, simplifications
- documentation updates"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
softirq: Use _RET_IP_
genirq: Add the generic chip to the genirq docbook
genirq: generic-chip: Export some irq_gc_ functions
genirq: Fix can_request_irq() for IRQs without an action
irqchip: exynos-combiner: Staticize combiner_init
irqchip: Add support for ARMv7-M NVIC
irqchip: Add TB10x interrupt controller driver
irqdomain: Use irq_get_trigger_type() to get IRQ flags
MIPS: octeon: Use irq_get_trigger_type() to get IRQ flags
arm: orion: Use irq_get_trigger_type() to get IRQ flags
mfd: stmpe: use irq_get_trigger_type() to get IRQ flags
mfd: twl4030-irq: Use irq_get_trigger_type() to get IRQ flags
gpio: mvebu: Use irq_get_trigger_type() to get IRQ flags
genirq: Add irq_get_trigger_type() to get IRQ flags
genirq: Irqchip: document gcflags arg of irq_alloc_domain_generic_chips
genirq: Set irq thread to RT priority on creation
irqchip: Add support for Marvell Orion SoCs
genirq: Add kerneldoc for irq_disable.
genirq: irqchip: Add mask to block out invalid irqs
genirq: Generic chip: Add linear irq domain support
...
Pull WW mutex support from Ingo Molnar:
"This tree adds support for wound/wait style locks, which the graphics
guys would like to make use of in the TTM graphics subsystem.
Wound/wait mutexes are used when other multiple lock acquisitions of a
similar type can be done in an arbitrary order. The deadlock handling
used here is called wait/wound in the RDBMS literature: The older
tasks waits until it can acquire the contended lock. The younger
tasks needs to back off and drop all the locks it is currently
holding, ie the younger task is wounded.
See this LWN.net description of W/W mutexes:
https://lwn.net/Articles/548909/
The comments there outline specific usecases for this facility (which
have already been implemented for the DRM tree).
Also see Documentation/ww-mutex-design.txt for more details"
* 'core-mutexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking-selftests: Handle unexpected failures more strictly
mutex: Add more w/w tests to test EDEADLK path handling
mutex: Add more tests to lib/locking-selftest.c
mutex: Add w/w tests to lib/locking-selftest.c
mutex: Add w/w mutex slowpath debugging
mutex: Add support for wound/wait style locks
arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not
Pull locking changes from Ingo Molnar:
"Four miscellanous standalone fixes for futexes, rtmutexes and
Kconfig.locks."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Use freezable blocking call
futex: Take hugepages into account when generating futex_key
rtmutex: Document rt_mutex_adjust_prio_chain()
locking: Fix copy/paste errors of "ARCH_INLINE_*_UNLOCK_BH"
Commit a695cb5816 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do echo 1 > foo/events/sched/sched_switch 2> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the write to the event
is opened, but before the enabling happens.
The solution is to make sure the trace_array is available before succeeding in
opening for write, and incerment the ref counter while opened.
Now the instance can be deleted when the events are writing to the buffer,
but the deletion of the instance will disable all events before the instance
is actually deleted.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Here's the big driver core merge for 3.11-rc1
Lots of little things, and larger firmware subsystem updates, all
described in the shortlog. Nice thing here is that we finally get rid
of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
been always on for a number of kernel releases, now it's just removed.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlHRsGMACgkQMUfUDdst+ylIIACfW8lLxOPVK+iYG699TWEBAkp0
LFEAnjlpAMJ1JnoZCuWDZObNCev93zGB
=020+
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here's the big driver core merge for 3.11-rc1
Lots of little things, and larger firmware subsystem updates, all
described in the shortlog. Nice thing here is that we finally get rid
of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
been always on for a number of kernel releases, now it's just
removed)"
* tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
driver core: device.h: fix doc compilation warnings
firmware loader: fix another compile warning with PM_SLEEP unset
build some drivers only when compile-testing
firmware loader: fix compile warning with PM_SLEEP set
kobject: sanitize argument for format string
sysfs_notify is only possible on file attributes
firmware loader: simplify holding module for request_firmware
firmware loader: don't export cache_firmware and uncache_firmware
drivers/base: Use attribute groups to create sysfs memory files
firmware loader: fix compile warning
firmware loader: fix build failure with !CONFIG_FW_LOADER_USER_HELPER
Documentation: Updated broken link in HOWTO
Finally eradicate CONFIG_HOTPLUG
driver core: firmware loader: kill FW_ACTION_NOHOTPLUG requests before suspend
driver core: firmware loader: don't cache FW_ACTION_NOHOTPLUG firmware
Documentation: Tidy up some drivers/base/core.c kerneldoc content.
platform_device: use a macro instead of platform_driver_register
firmware: move EXPORT_SYMBOL annotations
firmware: Avoid deadlock of usermodehelper lock at shutdown
dell_rbu: Select CONFIG_FW_LOADER_USER_HELPER explicitly
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIVAwUAUdLdUxOxKuMESys7AQK1kQ//W7fgFXCG+5XVk4ECHGN5tqRn4tU69DY0
9nYU2/y1wbqV5cTO36XTcFPQK1qbW2ZdyvEZ2CF8OfwtQpLmcALGtpBIgJwYs+4H
DMkgO06zdk4caxc0C4JBIGs+MDeLNk2SQObqblGl1BAQKQ5cqsCLsIZ/rxln999m
ufuobfns1YvuHkzMtswUDmm3zWMpwqqPAbbl+fTwPU683a/AleckG2ACyFvKZAxA
OyI8kJR4e33a3/BGo/5OFb3qI1+Z25EOWdvdnM+r4hdKJZF9ZySlyc640GZHAO2J
wKj5lYp1nBpyNPvYvly174s2MxPju1CRHb7gxcV4LX3vtEY4/MCg7m6P46EUfC6R
C3V7PMMCjZXEQ01MKEmGig47EJKIiecCQUZupJnP7HFKPzeJR9mQZFd68WqzswAM
w9hcCw9hQ9y/kTDVrTVCHs0Q9iTxShfrJyfRJnQ1VcoT+1dieruTa9am9OBKiEw6
CQrPjq9RZZfsZHYr6RlGZHGJyzjrTzrf6EhxwmgaCxWycpvCuV7z76YgAVZI7V4r
qnJmH8dXWdoSA7nZ6sgsb5TRCLT9wu1nNId0DMpAGB1cDGga/55AZtqxdoJLnlkj
y/4wQavIrkfHHuS8c3gzVXPtYmM19CHgcKRFydXD0uGobzfxwYKTKMH+Gviu1NnH
/pGNNY2vVGI=
=Wjhu
-----END PGP SIGNATURE-----
Merge tag 'fscache-20130702' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull FS-Cache updates from David Howells:
"This contains a number of fixes for various FS-Cache issues plus some
cleanups. The commits are, in order:
1) Provide a system wait_on_atomic_t() and wake_up_atomic_t() sharing
the bit-wait table (enhancement for #8).
2) Don't put spin_lock() in a while-condition as spin_lock() may have
a do {} while(0) wrapper (cleanup).
3) Symbolically name i_mutex lock classes rather than using numbers
in CacheFiles (cleanup).
4) Don't sleep in page release if __GFP_FS is not set (deadlock vs
ext4).
5) Uninline fscache_object_init() (cleanup for #7).
6) Wrap checks on object state (cleanup for #7).
7) Simplify the object state machine by separating work states from
wait states.
8) Simplify cookie retention by objects (NULL pointer deref fix).
9) Remove unused list_to_page() macro (cleanup).
10) Make the remaining-pages counter in the retrieval op atomic
(assertion failure fix).
11) Don't use spin_is_locked() in assertions (assertion failure fix)"
* tag 'fscache-20130702' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
FS-Cache: Don't use spin_is_locked() in assertions
FS-Cache: The retrieval remaining-pages counter needs to be atomic_t
cachefiles: remove unused macro list_to_page()
FS-Cache: Simplify cookie retention for fscache_objects, fixing oops
FS-Cache: Fix object state machine to have separate work and wait states
FS-Cache: Wrap checks on object state
FS-Cache: Uninline fscache_object_init()
FS-Cache: Don't sleep in page release if __GFP_FS is not set
CacheFiles: name i_mutex lock class explicitly
fs/fscache: remove spin_lock() from the condition in while()
Add wait_on_atomic_t() and wake_up_atomic_t()
When a trace file is opened that may access a trace array, it must
increment its ref count to prevent it from being deleted.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a695cb5816 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do cat foo/trace &> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the reader starts
to open the file but before it grabs the trace_types_mutex.
The solution is to validate the trace array before using it. If the trace
array does not exist in the list of trace arrays, then it returns -ENODEV.
There's a possibility that a trace_array could be deleted and a new one
created and the open would open its file instead. But that is very minor as
it will just return the data of the new trace array, it may confuse the user
but it will not crash the system. As this can only be done by root anyway,
the race will only occur if root is deleting what its trying to read at
the same time.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The recent implementation of a generic dummy timer resulted in a
different registration order of per cpu local timers which made the
broadcast control logic go belly up.
If the dummy timer is the first clock event device which is registered
for a CPU, then it is installed, the broadcast timer is initialized
and the CPU is marked as broadcast target.
If a real clock event device is installed after that, we can fail to
take the CPU out of the broadcast mask. In the worst case we end up
with two periodic timer events firing for the same CPU. One from the
per cpu hardware device and one from the broadcast.
Now the problem is that we have no way to distinguish whether the
system is in a state which makes broadcasting necessary or the
broadcast bit was set due to the nonfunctional dummy timer
installment.
To solve this we need to keep track of the system state seperately and
provide a more detailed decision logic whether we keep the CPU in
broadcast mode or not.
The old decision logic only clears the broadcast mode, if the newly
installed clock event device is not affected by power states.
The new logic clears the broadcast mode if one of the following is
true:
- The new device is not affected by power states.
- The system is not in a power state affected mode
- The system has switched to oneshot mode. The oneshot broadcast is
controlled from the deep idle state. The CPU is not in idle at
this point, so it's safe to remove it from the mask.
If we clear the broadcast bit for the CPU when a new device is
installed, we also shutdown the broadcast device when this was the
last CPU in the broadcast mask.
If the broadcast bit is kept, then we leave the new device in shutdown
state and rely on the broadcast to deliver the timer interrupts via
the broadcast ipis.
Reported-and-tested-by: Stehle Vincent-B46079 <B46079@freescale.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>,
Cc: Mark Rutland <mark.rutland@arm.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When the system switches from periodic to oneshot mode, the broadcast
logic causes a possibility that a CPU which has not yet switched to
oneshot mode puts its own clock event device into oneshot mode without
updating the state and the timer handler.
CPU0 CPU1
per cpu tickdev is in periodic mode
and switched to broadcast
Switch to oneshot mode
tick_broadcast_switch_to_oneshot()
cpumask_copy(tick_oneshot_broacast_mask,
tick_broadcast_mask);
broadcast device mode = oneshot
Timer interrupt
irq_enter()
tick_check_oneshot_broadcast()
dev->set_mode(ONESHOT);
tick_handle_periodic()
if (dev->mode == ONESHOT)
dev->next_event += period;
FAIL.
We fail, because dev->next_event contains KTIME_MAX, if the device was
in periodic mode before the uncontrolled switch to oneshot happened.
We must copy the broadcast bits over to the oneshot mask, because
otherwise a CPU which relies on the broadcast would not been woken up
anymore after the broadcast device switched to oneshot mode.
So we need to verify in tick_check_oneshot_broadcast() whether the CPU
has already switched to oneshot mode. If not, leave the device
untouched and let the CPU switch controlled into oneshot mode.
This is a long standing bug, which was never noticed, because the main
user of the broadcast x86 cannot run into that scenario, AFAICT. The
nonarchitected timer mess of ARM creates a gazillion of differently
broken abominations which trigger the shortcomings of that broadcast
code, which better had never been necessary in the first place.
Reported-and-tested-by: Stehle Vincent-B46079 <B46079@freescale.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>,
Cc: Mark Rutland <mark.rutland@arm.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In periodic mode we remove offline cpus from the broadcast propagation
mask. In oneshot mode we fail to do so. This was not a problem so far,
but the recent changes to the broadcast propagation introduced a
constellation which can result in a NULL pointer dereference.
What happens is:
CPU0 CPU1
idle()
arch_idle()
tick_broadcast_oneshot_control(OFF);
set cpu1 in tick_broadcast_force_mask
if (cpu_offline())
arch_cpu_dead()
cpu_dead_cleanup(cpu1)
cpu1 tickdevice pointer = NULL
broadcast interrupt
dereference cpu1 tickdevice pointer -> OOPS
We dereference the pointer because cpu1 is still set in
tick_broadcast_force_mask and tick_do_broadcast() expects a valid
cpumask and therefor lacks any further checks.
Remove the cpu from the tick_broadcast_force_mask before we set the
tick device pointer to NULL. Also add a sanity check to the oneshot
broadcast function, so we can detect such issues w/o crashing the
machine.
Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: athorlton@sgi.com
Cc: CAI Qian <caiqian@redhat.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1306261303260.4013@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Although parameters are supposed to be part of the kernel API, experimental
parameters are often removed. In addition, downgrading a kernel might cause
previously-working modules to fail to load.
On balance, it's probably better to warn, and load the module anyway.
This may let through a typo, but at least the logs will show it.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There is no such path as /sys/parameters, module parameters live in
/sys/module/*/parameters.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we pass a pointer to a const string in the form "module:symbol"
module_kallsyms_lookup_name() will try to split the string at the colon,
i.e., will try to modify r/o data. That will, in fact, fail on a kernel
with enabled CONFIG_DEBUG_RODATA.
Avoid modifying the passed string in module_kallsyms_lookup_name(),
modify find_module_all() instead to pass it the module name length.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There are multiple places where the ftrace_trace_arrays list is accessed in
trace_events.c without the trace_types_lock held.
Link: http://lkml.kernel.org/r/1372732674-22726-1-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_marker file was present for each new instance created, but it
added the trace mark to the global trace buffer instead of to
the instance's buffer.
Link: http://lkml.kernel.org/r/1372717885-4543-2-git-send-email-azl@google.com
Cc: David Sharp <dhsharp@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the kernel command line ftrace filter parameters are set
(ftrace_filter or ftrace_notrace), force the function self test to
pass, with a warning why it was forced.
If the user adds a filter to the kernel command line, it is assumed
that they know what they are doing, and the self test should just not
run instead of failing (which disables function tracing) or clearing
the filter, as that will probably annoy the user.
If the user wants the selftest to run, the message will tell them why
it did not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kprobe_perf_func() and kretprobe_perf_func() pass addr=ip to
perf_trace_buf_submit() for no reason.
This sets perf_sample_data->addr for PERF_SAMPLE_ADDR, we already
have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Link: http://lkml.kernel.org/r/20130620173811.GA13161@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ring buffer is disabled and the irqsoff tracer records a trace it
will clear out its buffer and lose the data it had previously recorded.
Currently there's a callback when writing to the tracing_of file, but if
tracing is disabled via the function tracer trigger, it will not inform
the irqsoff tracer to stop recording.
By using the "mirror" flag (buffer_disabled) in the trace_array, that keeps
track of the status of the trace_array's buffer, it gives the irqsoff
tracer a fast way to know if it should record a new trace or not.
The flag may be a little behind the real state of the buffer, but it
should not affect the trace too much. It's more important for the irqsoff
tracer to be fast.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I think that "ftrace_event_file *trace_probe[]" complicates the
code for no reason, turn it into list_head to simplify the code.
enable_trace_probe() no longer needs synchronize_sched().
This needs the extra sizeof(list_head) memory for every attached
ftrace_event_file, hopefully not a problem in this case.
Link: http://lkml.kernel.org/r/20130620173814.GA13165@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The comment on the soft disable 'disable' case of
__ftrace_event_enable_disable() states that the soft disable bit
should be cleared in that case, but currently only the soft mode bit
is actually cleared.
This essentially leaves the standard non-soft-enable enable/disable
paths as the only way to clear the soft disable flag, but the soft
disable bit should also be cleared when removing a trigger with '!'.
Also, the SOFT_DISABLED bit should never be set if SOFT_MODE is
cleared.
This fixes the above discrepancies.
Link: http://lkml.kernel.org/r/b9c68dd50bc07019e6c67d3f9b29be4ef1b2badb.1372479499.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
enable_trace_probe() and disable_trace_probe() should not worry about
serialization, the caller (perf_trace_init or __ftrace_set_clr_event)
holds event_mutex.
They are also called by kprobe_trace_self_tests_init(), but this __init
function can't race with itself or trace_events.c
And note that this code depended on event_mutex even before 41a7dd420c
which introduced probe_enable_lock. In fact it assumes that the caller
kprobe_register() can never race with itself. Otherwise, say, tp->flags
manipulations are racy.
Link: http://lkml.kernel.org/r/20130620173809.GA13158@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change kprobe_perf_func()
and kretprobe_perf_func() to check call->perf_events beforehand
and return if this list is empty.
For example, "perf record -e some_probe -p1". Only /sbin/init will
report, all other threads which hit the same probe will do
perf_trace_buf_prepare/perf_trace_buf_submit just to realize that
nobody wants perf_swevent_event().
Link: http://lkml.kernel.org/r/20130620173806.GA13151@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Running the following:
# cd /sys/kernel/debug/tracing
# echo p:i do_sys_open > kprobe_events
# echo p:j schedule >> kprobe_events
# cat kprobe_events
p:kprobes/i do_sys_open
p:kprobes/j schedule
# echo p:i do_sys_open >> kprobe_events
# cat kprobe_events
p:kprobes/j schedule
p:kprobes/i do_sys_open
# ls /sys/kernel/debug/tracing/events/kprobes/
enable filter j
Notice that the 'i' is missing from the kprobes directory.
The console produces:
"Failed to create system directory kprobes"
This is because kprobes passes in a allocated name for the system
and the ftrace event subsystem saves off that name instead of creating
a duplicate for it. But the kprobes may free the system name making
the pointer to it invalid.
This bug was introduced by 92edca073c "tracing: Use direct field, type
and system names" which switched from using kstrdup() on the system name
in favor of just keeping apointer to it, as the internal ftrace event
system names are static and exist for the life of the computer being booted.
Instead of reverting back to duplicating system names again, we can use
core_kernel_data() to determine if the passed in name was allocated or
static. Then use the MSB of the ref_count to be a flag to keep track if
the name was allocated or not. Then we can still save from having to duplicate
strings that will always exist, but still copy the ones that may be freed.
Cc: stable@vger.kernel.org # 3.10
Reported-by: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJR0K2gAAoJEHm+PkMAQRiGWsEH+gMZSN1qRm34hZ82q1Tx7HvL
Eb/Gsl3Qw/7G2TlTqgjBUs36IdqV9O2cui/aa3/TfXvdvrx+0GlhRkEwQPc+ygcO
Mvoyoke4tT4+4jVFdCg1J8avREsa28/6oaHs0ZZxuVmJBBLTJH7aXaNsGn6eU1q9
9+p798MQis6naIiPC63somlZcCIiBhsuWCPWpEfLMn8G1HWAFTM3xXIbNBqe/brS
bmIOfhomlIZ5dcdaXGvjtP3+KJhkNDwhkPC4tVYu8JqqgSlrE+a+EGyEuuGqKk10
U+swiqyuD31uBI9ga54u/2FzSqDiAu6YOcMXevjo/m3g9XLdYbYLvN+nvN8alCQ=
=Ob6Z
-----END PGP SIGNATURE-----
Merge tag 'v3.10' into sched/core
Merge in a recent upstream commit:
c2853c8df5 include/linux/math64.h: add div64_ul()
because:
72a4cf20cb sched: Change cfs_rq load avg to unsigned long
relies on it.
[ We don't rebase sched/core for this, because the handful of
followup commits after the broken commit are not behavioral
changes so are unlikely to be needed during bisection. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
0ce6cba357 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when
comparing mount options") only updated the remount path but
CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options
while mounting an existing hierarchy. As option mismatch triggers a
warning but doesn't fail the mount without sane_behavior, this only
triggers a spurious warning message.
Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new
and existing root options.
Signed-off-by: Tejun Heo <tj@kernel.org>
This __put_user() could be used by unprivileged processes to write into
kernel memory. The issue here is that even if copy_siginfo_to_user()
fails, the error code is not checked before __put_user() is executed.
Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
so it has not hit a stable release yet.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer fix from Thomas Gleixner:
"Correct an ordering issue in the tick broadcast code. I really wish
we'd get compensation for pain and suffering for each line of code we
write to work around dysfunctional timer hardware."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Fix tick_broadcast_pending_mask not cleared
If the clock was set (stepped), set the action parameter to functions
in the pvclock gtod notifier chain to non-zero. This allows the
callee to only do work if the clock was stepped.
This will be used on Xen as the synchronization of the Xen wallclock
to the control domain's (dom0) system time will be done with this
notifier and updating on every timer tick is unnecessary and too
expensive.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/1372329348-20841-4-git-send-email-david.vrabel@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Instead of passing multiple bools to timekeeping_updated(), define
flags and use a single 'action' parameter. It is then more obvious
what each timekeeping_update() call does.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/1372329348-20841-3-git-send-email-david.vrabel@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
hrtimers_resume() only reprograms the timers for the current CPU as it
assumes that all other CPUs are offline at this point in the resume
process. If other CPUs are online then their timers will not be
corrected and they may fire at the wrong time.
When running as a Xen guest, this assumption is not true. Non-boot
CPUs are only stopped with IRQs disabled instead of offlining them.
This is a performance optimization as disabling the CPUs would add an
unacceptable amount of additional downtime during a live migration (>
200 ms for a 4 VCPU guest).
hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
as the other CPUs will be stopped with IRQs disabled. Instead, defer
the call to the next softirq.
[ tglx: Separated the xen change out ]
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/1372329348-20841-2-git-send-email-david.vrabel@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Direct compare of jiffies related values does not work in the wrap
around case. Replace it with time_is_after_jiffies().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/519BC066.5080600@acm.org
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* freezer:
af_unix: use freezable blocking calls in read
sigtimedwait: use freezable blocking call
nanosleep: use freezable blocking call
futex: use freezable blocking call
select: use freezable blocking call
epoll: use freezable blocking call
binder: use freezable blocking calls
freezer: add new freezable helpers using freezer_do_not_count()
freezer: convert freezable helpers to static inline where possible
freezer: convert freezable helpers to freezer_do_not_count()
freezer: skip waking up tasks with PF_FREEZER_SKIP set
freezer: shorten freezer sleep time using exponential backoff
lockdep: check that no locks held at freeze time
lockdep: remove task argument from debug_check_no_locks_held
freezer: add unsafe versions of freezable helpers for CIFS
freezer: add unsafe versions of freezable helpers for NFS
When building imx_v6_v7_defconfig with imx-drm drivers selected as
modules, we get the following build errors:
ERROR: "irq_gc_mask_clr_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_mask_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_ack_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
Export the required functions to avoid this problem.
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Cc: shawn.guo@linaro.org
Cc: kernel@pengutronix.de
Link: http://lkml.kernel.org/r/1372389789-7048-1-git-send-email-festevam@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 02725e7471 ('genirq: Use irq_get/put functions'),
inadvertently changed can_request_irq() to return 0 for IRQs that have
no action. This causes pcibios_lookup_irq() to select only IRQs that
already have an action with IRQF_SHARED set, or to fail if there are
none. Change can_request_irq() to return 1 for IRQs that have no
action (if the first two conditions are met).
Reported-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Tested-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> (against 3.2)
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 709647@bugs.debian.org
Cc: stable@vger.kernel.org # 2.6.39+
Link: http://bugs.debian.org/709647
Link: http://lkml.kernel.org/r/1372383630.23847.40.camel@deadeye.wl.decadent.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch alters format string's width, to align all statistics
at par with the longest struct sched_statistic member name under
/proc/<PID>/sched.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20130627165005.GA15583@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1672d04070 ("cgroup: fix cgroupfs_root early destruction path")
introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
subsys binding on a new root; however, this broke remounts.
cgroup_remount() doesn't allow changing root options via remount and
CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
makes the function reject all remounts.
Fix it by putting the options part in the lower 16 bits of root->flags
and masking the comparions. While at it, make cgroup_remount() emit
an error message explaining why it's rejecting a remount request, so
that it's less of a mystery.
Signed-off-by: Tejun Heo <tj@kernel.org>
pm_trace uses the system's Real Time Clock (RTC) to save the magic
number. The reason for this is that the RTC is the only reliably
available piece of hardware during resume operations where a value
can be set that will survive a reboot.
Consequence is that after a resume (even if it is successful) your
system clock will have a value corresponding to the magic number
instead of the correct date/time! It is therefore advisable to use
a program like ntp-date or rdate to reset the correct date/time from
an external time source when using this trace option.
There is no run-time message to warn users of the consequences of
enabling pm_trace. Adding a warning message to pm_trace_store()
will serve as a reminder to users to set the system date and time
after resume.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
At present we print per-entity load-tracking statistics for
cfs_rq of cgroups/runqueues. Given that per task statistics
is maintained, it can be used to know the contribution made
by the task to its parenting cfs_rq level.
This patch adds per-task load-tracking statistics to /proc/<PID>/sched.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130625080336.GA20175@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since no one use it.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-13-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Similar to runnable_load_avg, blocked_load_avg variable, long type is
enough for removed_load in 64 bit or 32 bit machine.
Then we avoid the expensive atomic64 operations on 32 bit machine.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-12-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since tg->load_avg is smaller than tg->load_weight, we don't need a
atomic64_t variable for load_avg in 32 bit machine.
The same reason for cfs_rq->tg_load_contrib.
The atomic_long_t/unsigned long variable type are more efficient and
convenience for them.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-11-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are
smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64
vaiables to describe them. unsigned long is more efficient and convenience.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-10-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Aside from using runnable load average in background, move_tasks is
also the key function in load balance. We need consider the runnable
load average in it in order to make it an apple to apple load
comparison.
Morten had caught a div u64 bug on ARM, thanks!
Thanks-to: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-8-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.
We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.
Or only add blocked_load_avg into get_rq_runable_load, hackbench still
drop a little on NHM EX.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-7-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To get the latest runnable info, we need do this cpuload update after
task_tick.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-6-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The woken migrated task will __synchronize_entity_decay(se); in
migrate_task_rq_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before
update_entity_load_avg, in order to avoid sleep time is updated twice
for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.
However if the sleeping task is woken up from the same cpu, it miss
the last_runnable_update before update_entity_load_avg(se, 0, 1), then
the sleep time was used twice in both functions. So we need to remove
the double sleep time accounting.
Paul also contributed some code comments in this commit.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-5-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We need to initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task. Otherwise random values of above variables cause a
mess when a new task is enqueued:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg
and make fork balancing imbalance due to incorrect load_avg_contrib.
Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().
PeterZ said:
> So the 'problem' is that our running avg is a 'floating' average; ie. it
> decays with time. Now we have to guess about the future of our newly
> spawned task -- something that is nigh impossible seeing these CPU
> vendors keep refusing to implement the crystal ball instruction.
>
> So there's two asymptotic cases we want to deal well with; 1) the case
> where the newly spawned program will be 'nearly' idle for its lifetime;
> and 2) the case where its cpu-bound.
>
> Since we have to guess, we'll go for worst case and assume its
> cpu-bound; now we don't want to make the avg so heavy adjusting to the
> near-idle case takes forever. We want to be able to quickly adjust and
> lower our running avg.
>
> Now we also don't want to make our avg too light, such that it gets
> decremented just for the new task not having had a chance to run yet --
> even if when it would run, it would be more cpu-bound than not.
>
> So what we do is we make the initial avg of the same duration as that we
> guess it takes to run each task on the system at least once -- aka
> sched_slice().
>
> Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> case it should be good enough.
Paul also contributed most of the code comments in this commit.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Paul Turner <pjt@google.com>
[peterz; added explanation of sched_slice() usage]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following 2 variables are only used under CONFIG_SMP, so its
better to move their definiation into CONFIG_SMP too.
atomic64_t load_avg;
atomic_t runnable_avg;
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.
Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f), but also need to revert.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Ingo Molnar:
"Three small fixlets"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
kprobes: Fix arch_prepare_kprobe to handle copy insn failures
kernel/cgroup.c still has places where a RCU pointer is set and
accessed directly without going through RCU_INIT_POINTER() or
rcu_dereference_protected(). They're all properly protected accesses
so nothing is broken but it leads to spurious sparse RCU address space
warnings.
Substitute direct accesses with RCU_INIT_POINTER() and
rcu_dereference_protected(). Note that %true is specified as the
extra condition for all derference updates. This isn't ideal as all
it does is suppressing warning without actually policing
synchronization rules; however, most are scheduled to be removed
pretty soon along with css_id itself, so no reason to be more
elaborate.
Combined with the previous changes, this removes all RCU related
sparse warnings from cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by; Li Zefan <lizefan@huawei.com>
There are several places in kernel/cgroup.c where task->cgroups is
accessed and modified without going through proper RCU accessors.
None is broken as they're all lock protected accesses; however, this
still triggers sparse RCU address space warnings.
* Consistently use task_css_set() for task->cgroups dereferencing.
* Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
exit.
* Remove unnecessary rcu_dereference_raw() from cset->subsys[]
dereference in cgroup_exit().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Li Zefan <lizefan@huawei.com>
This isn't strictly necessary as all subsystems specified in
@subsys_mask are guaranteed to be pinned; however, it does spuriously
trigger lockdep warning. Let's grab cgroup_mutex around it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroupfs_root used to have ->actual_subsys_mask in addition to
->subsys_mask. a8a648c4ac ("cgroup: remove
cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
essentially temporary and doesn't belong in cgroupfs_root; however,
the patch made it impossible to tell whether a cgroupfs_root actually
has the subsystems bound or just have the bits set leading to the
following BUG when trying to mount with subsystems which are already
mounted elsewhere.
kernel BUG at kernel/cgroup.c:1038!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
...
CPU: 1 PID: 7973 Comm: mount Tainted: G W 3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
RIP: 0010:[<ffffffff81249b29>] [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
...
Call Trace:
[<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
[<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
[<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
[<ffffffff81257169>] cpuset_mount+0xd9/0x110
[<ffffffff813d2580>] mount_fs+0xb0/0x2d0
[<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
[<ffffffff814070b5>] do_new_mount+0x145/0x2c0
[<ffffffff814085d6>] do_mount+0x356/0x3c0
[<ffffffff8140873d>] SyS_mount+0xfd/0x140
[<ffffffff854eb600>] tracesys+0xdd/0xe2
We still want rebind_subsystems() to take added/removed masks, so
let's fix it by marking whether a cgroupfs_root has finished binding
or not. Also, document what's going on around ->subsys_mask
initialization so that similar mistakes aren't repeated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Injects EDEADLK conditions at pseudo-random interval, with
exponential backoff up to UINT_MAX (to ensure that every lock
operation still completes in a reasonable time).
This way we can test the wound slowpath even for ww mutex users
where contention is never expected, and the ww deadlock
avoidance algorithm is only needed for correctness against
malicious userspace. An example would be protecting kernel
modesetting properties, which thanks to single-threaded X isn't
really expected to contend, ever.
I've looked into using the CONFIG_FAULT_INJECTION
infrastructure, but decided against it for two reasons:
- EDEADLK handling is mandatory for ww mutex users and should
never affect the outcome of a syscall. This is in contrast to -ENOMEM
injection. So fine configurability isn't required.
- The fault injection framework only allows to set a simple
probability for failure. Now the probability that a ww mutex acquire
stage with N locks will never complete (due to too many injected
EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock
operations for the completely uncontended case would be O(exp(N)).
The per-acuiqire ctx exponential backoff solution choosen here only
results in O(log N) overhead due to injection and so O(log N * N)
lock operations. This way we can fail with high probability (and so
have good test coverage even for fancy backoff and lock acquisition
paths) without running into patalogical cases.
Note that EDEADLK will only ever be injected when we managed to
acquire the lock. This prevents any behaviour changes for users
which rely on the EALREADY semantics.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.
For full documentation please read Documentation/ww-mutex-design.txt.
References: https://lwn.net/Articles/548909/
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Rob Clark <robdclark@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This will allow me to call functions that have multiple
arguments if fastpath fails. This is required to support ticket
mutexes, because they need to be able to pass an extra argument
to the fail function.
Originally I duplicated the functions, by adding
__mutex_fastpath_lock_retval_arg. This ended up being just a
duplication of the existing function, so a way to test if
fastpath was called ended up being better.
This also cleaned up the reservation mutex patch some by being
able to call an atomic_set instead of atomic_xchg, and making it
easier to detect if the wrong unlock function was previously
used.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: robclark@gmail.com
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ip_vs.h is not necessary for sysctl_binary.c.
prepare for the next patch to avoid compile issue.
Signed-off-by: JunweiZhang <junwei.zhang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Signed-off-by: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: arve@android.com
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.
Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.
Steps to reproduce the bug:
1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
mapping.
The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
their keys solely depend on the user space address.
2. Lock mutex1 and mutex2
3. Create thread1 and in the thread function lock mutex1, which
results in thread1 blocking on the locked mutex1.
4. Create thread2 and in the thread function lock mutex2, which
results in thread2 blocking on the locked mutex2.
5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
still blocks on mutex2 because the futex_key points to mutex1.
To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.
Mappings which are not based on hugetlbfs are not affected and still
use page->index.
Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.
[ tglx: Massaged changelog ]
Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn>
Reviewed-by: 'Mel Gorman' <mgorman@suse.de>
Acked-by: 'Darren Hart' <dvhart@linux.intel.com>
Cc: 'Peter Zijlstra' <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"),
hierarchy IDs were allocated from 0. As the dummy hierarchy was
always the one first initialized, it got assigned 0 and all other
hierarchies from 1. The patch accidentally changed the minimum
useable ID to 2.
Let's restore ID 0 for dummy_root and while at it reserve 1 for
unified hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
There are quite a few places where all loaded [builtin] subsys are
iterated. Implement for_each_[builtin_]subsys() and replace manual
iterations with those to simplify those places a bit. The new
iterators automatically skip NULL subsystems. This shouldn't cause
any functional difference.
Iteration loops which scan all subsystems and then skipping modular
ones explicitly are converted to use for_each_builtin_subsys().
While at it, reorder variable declarations and adjust whitespaces a
bit in the affected functions.
v2: Add lockdep_assert_held() in for_each_subsys() and add comments
about synchronization as suggested by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_init() was doing init_css_set initialization outside
cgroup_mutex, which is fine but we want to add lockdep annotation on
subsystem iterations and cgroup_init() will trigger it spuriously.
Move init_css_set initialization inside cgroup_mutex.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Use irq_get_trigger_type() to get the IRQ trigger type flags
instead calling irqd_get_trigger_type(irq_desc_get_irq_data(virq))
Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/1371228049-27080-8-git-send-email-javier.martinez@collabora.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
for_each_subsys() walks over subsystems attached to a hierarchy and
we're gonna add iterators which walk over all available subsystems.
Rename for_each_subsys() to for_each_root_subsys() so that it's more
appropriately named and for_each_subsys() can be used to iterate all
subsystems.
While at it, remove unnecessary underbar prefix from macro arguments,
put them inside parentheses, and adjust indentation for the two
for_each_*() macros.
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
find_css_set() passes uninitialized on-stack template[] array to
find_existing_css_set() which sets the entries for all subsystems.
Passing around an uninitialized array is a bit icky and we want to
introduce an iterator which only iterates loaded subsystems. Let's
initialize it on definition.
While at it, also make the following cosmetic cleanups.
* Convert to proper /** comments.
* Reorder variable declarations.
* Replace comment on synchronization with lockdep_assert_held().
This patch doesn't make any functional differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup curiously has two subsystem masks, ->subsys_mask and
->actual_subsys_mask. The latter only exists because the new target
subsys_mask is passed into rebind_subsystems() via @root>subsys_mask.
rebind_subsystems() needs to know what the current mask is to decide
how to reach the target mask so ->actual_subsys_mask is used as the
temp location to remember the current state.
Adding a temporary field to a permanent data structure is rather silly
and can be misleading. Update rebind_subsystems() to take @added_mask
and @removed_mask instead and remove @root->actual_subsys_mask.
This patch shouldn't introduce any behavior changes.
v2: Comment and description updated as suggested by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Global variable names in kernel/cgroup.c are asking for trouble -
subsys, roots, rootnode and so on. Rename them to have "cgroup_"
prefix.
* s/subsys/cgroup_subsys/
* s/rootnode/cgroup_dummy_root/
* s/dummytop/cgroup_cummy_top/
* s/roots/cgroup_roots/
* s/root_count/cgroup_root_count/
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
On an SMP system with only one global clockevent and a dummy
clockevent per CPU we run into problems. We want the dummy
clockevents to be registered as the per CPU tick devices, but
we can only achieve that if we register the dummy clockevents
before the global clockevent or if we artificially inflate the
rating of the dummy clockevents to be higher than the rating
of the global clockevent. Failure to do so leads to boot
hangs when the dummy timers are registered on all other CPUs
besides the CPU that accepted the global clockevent as its tick
device and there is no broadcast timer to poke the dummy
devices.
If we're registering multiple clockevents and one clockevent is
global and the other is local to a particular CPU we should
choose to use the local clockevent regardless of the rating of
the device. This way, if the clockevent is a dummy it will take
the tick device duty as long as there isn't a higher rated tick
device and any global clockevent will be bumped out into
broadcast mode, fixing the problem described above.
Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Tested-by: soren.brinkmann@xilinx.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20130613183950.GA32061@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 088f40b7b0 ("genirq: Generic chip:
Add linear irq domain support") missed kerneldoc for the gcflags
argument of irq_alloc_domain_generic_chips(). Add it now.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Link: http://lkml.kernel.org/r/1371564513-4327-1-git-send-email-james.hogan@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ERROR: space required before the open parenthesis '('
WARNING: Prefer pr_warn(... to pr_warning(...
Just fix above 2 issue.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Add the hardware interrupt number to the output of /proc/interrupts.
It is often important to have access to the hardware interrupt number because
it identifies exactly how an interrupt signal is wired up to the interrupt
controller. This is especially important when using irq_domains since irq
numbers get dynamically allocated in that case, and have no relation to the
actual hardware number.
Note: This output is currently conditional on whether or not the irq_domain
pointer is set; however hwirq could still be used without irq_domain. It
may be worthwhile to always output the hwirq number regardless of the
domain pointer.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Olof Johansson <olof@lixom.net>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Over the years, irq_linear_revmap() gained tests and checks to make sure
callers were using it safely, which while important, also make it less
of a fast path. After the irqdomain refactoring done recently, it is now
possible to make irq_linear_revmap() a fast path again. This patch moves
irq_linear_revmap() to the header file and makes it a static inline so
that interrupt controller drivers using a linear mapping can decode the
virq from a hwirq in just a couple of instructions.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Originally, irq_domain_associate_many() was designed to unwind the
mapped irqs on a failure of any individual association. However, that
proved to be a problem with certain IRQ controllers. Some of them only
support a subset of irqs, and will fail when attempting to map a
reserved IRQ. In those cases we want to map as many IRQs as possible, so
instead it is better for irq_domain_associate_many() to make a
best-effort attempt to map irqs, but not fail if any or all of them
don't succeed. If a caller really cares about how many irqs got
associated, then it should instead go back and check that all of the
irqs is cares about were mapped.
The original design open-coded the individual association code into the
body of irq_domain_associate_many(), but with no longer needing to
unwind associations, the code becomes simpler to split out
irq_domain_associate() to contain the bulk of the logic, and
irq_domain_associate_many() to be a simple loop wrapper.
This patch also adds a new error check to the associate path to make
sure it isn't called for an irq larger than the controller can handle,
and adds locking so that the irq_domain_mutex is held while setting up a
new association.
v3: Fixup missing change to irq_domain_add_tree()
v2: Fixup x86 warning. irq_domain_associate_many() no longer returns an
error code, but reports errors to the printk log directly. In the
majority of cases we don't actually want to fail if there is a
problem, but rather log it and still try to boot the system.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
irqdomain: Fix flubbed irq_domain_associate_many refactoring
commit d39046ec72, "irqdomain: Refactor irq_domain_associate_many()" was
missing the following hunk which causes a boot failure on anything using
irq_domain_add_tree() to allocate an irq domain.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Adds tracepoints to pm_qos_add_request, pm_qos_update_request,
pm_qos_remove_request, and pm_qos_update_request_timeout.
It's useful for checking pm_qos_class, value, and timeout_us.
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds tracepoints to pm_qos_update_target and
pm_qos_update_flags. It's useful for checking pm qos action,
previous value and current value.
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second. If
the sample length times the expected max number of samples
exceeds a configurable threshold, it drops the sample rate.
This way, we don't have a runaway sampling process eating up the
CPU.
This patch can tend to drop the sample rate down to level where
perf doesn't work very well. *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.
I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.
BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
[ Prettified it a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Peter Anvin:
"This series fixes a couple of build failures, and fixes MTRR cleanup
and memory setup on very specific memory maps.
Finally, it fixes triggering backtraces on all CPUs, which was
inadvertently disabled on x86."
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Fix dummy variable buffer allocation
x86: Fix trigger_all_cpu_backtrace() implementation
x86: Fix section mismatch on load_ucode_ap
x86: fix build error and kconfig for ia32_emulation and binfmt
range: Do not add new blank slot with add_range_with_merge
x86, mtrr: Fix original mtrr range get for mtrr_cleanup
The recent modification in the cpuidle framework consolidated the
timer broadcast code across the different drivers by setting a new
flag in the idle state. It tells the cpuidle core code to enter/exit
the broadcast mode for the cpu when entering a deep idle state. The
broadcast timer enter/exit is no longer handled by the back-end
driver.
This change made the local interrupt to be enabled *before* calling
CLOCK_EVENT_NOTIFY_EXIT.
On a tegra114, a four cores system, when the flag has been introduced
in the driver, the following warning appeared:
WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15
[<c00667f8>] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [<c0065cd0>] (tick_notify+0x240/0x40c)
[<c0065cd0>] (tick_notify+0x240/0x40c) from [<c0044724>] (notifier_call_chain+0x44/0x84)
[<c0044724>] (notifier_call_chain+0x44/0x84) from [<c0044828>] (raw_notifier_call_chain+0x18/0x20)
[<c0044828>] (raw_notifier_call_chain+0x18/0x20) from [<c00650cc>] (clockevents_notify+0x28/0x170)
[<c00650cc>] (clockevents_notify+0x28/0x170) from [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168)
[<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) from [<c000ea94>] (arch_cpu_idle+0x8/0x38)
[<c000ea94>] (arch_cpu_idle+0x8/0x38) from [<c005ea80>] (cpu_startup_entry+0x60/0x134)
[<c005ea80>] (cpu_startup_entry+0x60/0x134) from [<804fe9a4>] (0x804fe9a4)
I don't have the hardware, so I wasn't able to reproduce the warning
but after looking a while at the code, I deduced the following:
1. the CPU2 enters a deep idle state and sets the broadcast timer
2. the timer expires, the tick_handle_oneshot_broadcast function is
called, setting the tick_broadcast_pending_mask and waking up the
idle cpu CPU2
3. the CPU2 exits idle handles the interrupt and then invokes
tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which
runs the following code:
[...]
if (dev->next_event.tv64 == KTIME_MAX)
goto out;
if (cpumask_test_and_clear_cpu(cpu,
tick_broadcast_pending_mask))
goto out;
[...]
So if there is no next event scheduled for CPU2, we fulfil the
first condition and jump out without clearing the
tick_broadcast_pending_mask.
4. CPU2 goes to deep idle again and calls
tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but
with the tick_broadcast_pending_mask set for CPU2, triggering the
warning.
The issue only surfaced due to the modifications of the cpuidle
framework, which resulted in interrupts being enabled before the call
to the clockevents code. If the call happens before interrupts have
been enabled, the warning cannot trigger, because there is still the
event pending which caused the broadcast timer expiry.
Move the check for the next event below the check for the pending bit,
so the pending bit gets cleared whether an event is scheduled on the
cpu or not.
[ tglx: Massaged changelog ]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reported-and-tested-by: Joseph Lo <josephl@nvidia.com>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit a938da06 introduced a useful little log message to tell
users/debuggers which wakeup source aborted a suspend. However,
this message is only printed if the abort happens during the
in-kernel suspend path (after writing /sys/power/state).
The full specification of the /sys/power/wakeup_count facility
allows user-space power managers to double-check if wakeups have
already happened before it actually tries to suspend (e.g. while it
was running user-space pre-suspend hooks), by writing the last known
wakeup_count value to /sys/power/wakeup_count. This patch changes
the sysfs handler for that node to also print said log message if
that write fails, so that we can figure out the offending wakeup
source for both kinds of suspend aborts.
Signed-off-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The valid start index for pm_qos_array is not 0, but
PM_QOS_CPU_DMA_LATENCY. There is a null_pm_qos at index 0 of
pm_qos_array. However, null_pm_qos is not created as misc device so
that inclusion of 0 index for checking pm_qos_class especially for
file operations is not proper here.
[rjw: Changelog, a bit]
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull scheduler fixes from Ingo Molnar:
"Two smaller fixes - plus a context tracking tracing fix that is a bit
bigger"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/context-tracking: Add preempt_schedule_context() for tracing
sched: Fix clear NOHZ_BALANCE_KICK
sched/x86: Construct all sibling maps if smt
Pull perf fixes from Ingo Molnar:
"Four fixes. The mmap ones are unfortunately larger than desired -
fuzzing uncovered bugs that needed perf context life time management
changes to fix properly"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix broken PEBS-LL support on SNB-EP/IVB-EP
perf: Fix mmap() accounting hole
perf: Fix perf mmap bugs
kprobes: Fix to free gone and unused optprobes
Pull cpu idle fixes from Thomas Gleixner:
- Add a missing irq enable. Fallout of the idle conversion
- Fix stackprotector wreckage caused by the idle conversion
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
idle: Enable interrupts in the weak arch_cpu_idle() implementation
idle: Add the stack canary init to cpu_startup_entry()
Pull timer fixes from Thomas Gleixner:
- Fix inconstinant clock usage in virtual time accounting
- Fix a build error in KVM caused by the NOHZ work
- Remove a pointless timekeeping duty assignment which breaks NOHZ
- Use a proper notifier return value to avoid random behaviour
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Remove useless timekeeping duty attribution to broadcast source
nohz: Fix notifier return val that enforce timekeeping
kvm: Move guest entry/exit APIs to context_tracking
vtime: Use consistent clocks among nohz accounting
This patch simply moves all per-cpu variables into the new
single per-cpu "struct bp_cpuinfo".
To me this looks more logical and clean, but this can also
simplify the further potential changes. In particular, I do not
think this memory should be per-cpu, it is never used "locally".
After this change it is trivial to turn it into, say,
bootmem[nr_cpu_ids].
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. register_wide_hw_breakpoint() can use unregister_ if failure,
no need to duplicate the code.
2. "struct perf_event **pevent" adds the unnecesary lever of
indirection and complication, use per_cpu(*cpu_events, cpu).
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the trivial helper which simply returns cpumask_of() or
cpu_possible_mask depending on bp->cpu.
Change fetch_bp_busy_slots() and toggle_bp_slot() to always do
for_each_cpu(cpumask_of_bp) to simplify the code and avoid the
code duplication.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change toggle_bp_slot() to make "weight" negative if !enable.
This way we can always use "+ weight" without additional "if
(enable)" check and toggle_bp_task_slot() no longer needs this
arg.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155013.GA6337@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The enable/disable logic in toggle_bp_slot() is not symmetrical
and imho very confusing. "old_count" in toggle_bp_task_slot() is
actually new_count because this bp was already removed from the
list.
Change toggle_bp_slot() to always call list_add/list_del after
toggle_bp_task_slot(). This way old_idx is task_bp_pinned() and
this entry should be decremented, new_idx is +/-weight and we
need to increment this element. The code/logic looks obvious.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155011.GA6330@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
slot") in arch_install_hw_breakpoint() but the problem is not
arch-specific.
The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
but this doesn't account the "all cpus" events with iter->cpu <
0.
This means that, say, register_user_hw_breakpoint(tsk) can
happily create the arbitrary number > HBP_NUM of breakpoints
which can not be activated. toggle_bp_task_slot() is equally
wrong by the same reason and nr_task_bp_pinned[] can have
negative entries.
Simple test:
# perl -e 'sleep 1 while 1' &
# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`
Before this patch this triggers the same problem/WARN_ON(),
after the patch it correctly fails with -ENOSPC.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Building full dynticks now implies that all CPUs are forced
into RCU nocb mode through CONFIG_RCU_NOCB_CPU_ALL.
The dynamic check has become useless.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
When the watchdog runs, it prevents the full dynticks
CPUs from stopping their tick because the hard lockup
detector uses perf events internally, which in turn
rely on the periodic tick.
Since this is a rather confusing behaviour that is not
easy to track down and identify for those who want to
test CONFIG_NO_HZ_FULL, let's default disable the
watchdog on boot time when full dynticks is enabled.
The user can still enable it later on runtime using
proc or sysctl.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anish Singh <anish198519851985@gmail.com>
We have two very conflicting state variable names in the
watchdog:
* watchdog_enabled: This one reflects the user interface. It's
set to 1 by default and can be overriden with boot options
or sysctl/procfs interface.
* watchdog_disabled: This is the internal toggle state that
tells if watchdog threads, timers and NMI events are currently
running or not. This state mostly depends on the user settings.
It's a convenient state latch.
Now we really need to find clearer names because those
are just too confusing to encourage deep review.
watchdog_enabled now becomes watchdog_user_enabled to reflect
its purpose as an interface.
watchdog_disabled becomes watchdog_running to suggest its
role as a pure internal state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anish Singh <anish198519851985@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
Since tp->flags assignment was moved into function enable_trace_probe(),
there is no need to use trace_probe_is_enabled to check flags
in the same function.
Remove the unnecessary checking.
Link: http://lkml.kernel.org/r/51BA7B9E.3040807@huawei.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a traceoff_on_warning option in both the kernel command line as well
as a sysctl option. When set, any WARN*() function that is hit will cause
the tracing_on variable to be cleared, which disables writing to the
ring buffer.
This is useful especially when tracing a bug with function tracing. When
a warning is hit, the print caused by the warning can flood the trace with
the functions that producing the output for the warning. This can make the
resulting trace useless by either hiding where the bug happened, or worse,
by overflowing the buffer and losing the trace of the bug totally.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are some cases when filtering on a set flag of a field of a tracepoint
is useful. But currently the only filtering commands for numbered fields
is ==, !=, <, <=, >, >=. This does not help when you just want to trace if
a specific flag is set. For example:
> # sudo trace-cmd record -e brcmfmac:brcmf_dbg -f 'level & 0x40000'
> disable all
> enable brcmfmac:brcmf_dbg
> path = /sys/kernel/debug/tracing/events/brcmfmac/brcmf_dbg/enable
> (level & 0x40000)
> ^
> parse_error: Invalid operator
>
When trying to trace brcmf_dbg when level has its 1 << 18 bit set, the
filter fails to perform.
By allowing a binary '&' operation, this gives the user the ability to
test a bit.
Note, a binary '|' is not added, as it doesn't make sense as fields must
be compared to constants (for now), and ORing a constant will always return
true.
Link: http://lkml.kernel.org/r/1371057385.9844.261.camel@gandalf.local.home
Suggested-by: Arend van Spriel <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The user activation/deactivation of the watchdog through boot parameters
or systcl is currently implemented with a dance involving kthreads parking
and unparking methods: the threads are unconditionally registered on
boot and they park as soon as the user want the watchdog to be disabled.
This method involves a few noisy details to handle though: the watchdog
kthreads may be unparked anytime due to hotplug operations, after which
the watchdog internals have to decide to park again if it is user-disabled.
As a result the setup() and unpark() methods need to be able to request a
reparking. This is not currently supported in the kthread infrastructure
so this piece of the watchdog code only works halfway.
Besides, unparking/reparking the watchdog kthreads consume unnecessary
cputime on hotplug operations when those could be simply ignored in the
first place.
As suggested by Srivatsa, let's instead only register the watchdog
threads when they are needed. This way we don't need to think about
hotplug operations and we don't burden the CPU onlining when the watchdog
is simply disabled.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anish Singh <anish198519851985@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
If the user configures NO_HZ_FULL and defines nohz_full=XXX on the
kernel command line, or enables NO_HZ_FULL_ALL, but nohz fails
due to the machine having a unstable clock, warn about it.
We do not want users thinking that they are getting the benefit
of nohz when their machine can not support it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Pull RCU changes from Paul E. McKenney:
"The major changes for this series are:
1. Simplify RCU's grace-period and callback processing based on
the new numbering for callbacks. These were posted to LKML at
https://lkml.org/lkml/2013/5/20/330.
2. Documentation updates. These were posted to LKML at
https://lkml.org/lkml/2013/5/20/348.
3. Miscellaneous fixes, including converting a few remaining printk()
calls to pr_*(). These were posted to LKML at
https://lkml.org/lkml/2013/5/20/324.
4. SRCU-related changes and fixes. These were posted to LKML at
https://lkml.org/lkml/2013/5/20/425.
5. Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for
single-CPU low-latency systems. These were posted to LKML at
https://lkml.org/lkml/2013/5/20/427."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Just use struct ctl_table.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371063336.2069.22.camel@joe-AO722
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't
useful. In case it is required, then also we need to rearrange the code a bit as
we already accessed invalid pointer sd to get sg: sg = sd->groups.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/2bbe633cd74b431c05253a8ce61fdfd5066a531b.1370948150.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In build_sched_groups() we don't need to call get_group() for cpus
which are already covered in previous iterations. Calling get_group()
would mark the group used and eventually leak it since we wouldn't
connect it and not find it again to free it.
This will happen only in cases where sg->cpumask contained more than
one cpu (For any topology level). This patch would free sg's memory
for all cpus leaving the group leader as the group isn't marked used
now.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/7a61e955abdcbb1dfa9fe493f11a5ec53a11ddd3.1370948150.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the beginning of build_sched_groups() we called sched_domain_span() and
cached its return value in span. Few statements later we are calling it again to
get the same pointer.
Lets use the cached value instead as it hasn't changed in between.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/834ecd507071ad88aff039352dbc7e063dd996a7.1370948150.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For loop for traversing sched_domain_topology was used at multiple placed in
core.c. This patch removes code redundancy by creating for_each_sd_topology().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/e0e04542f54e9464bd9da54f5ccfe62ec6c4c0bc.1370861520.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Memory for sd is allocated with kzalloc_node() which will initialize its fields
with zero. In build_sched_domain() we are setting sd->child to child even if
child is NULL, which isn't required.
Lets do it only if child isn't NULL.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/f4753a1730051341003ad2ad29a3229c7356678e.1370861520.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are saving first scheduling domain for a cpu in build_sched_domains() by
iterating over the nested sd->child list. We don't actually need to do it this
way.
tl will be equal to sched_domain_topology for the first iteration and so we can
set *per_cpu_ptr(d.sd, i) based on that. So, save pointer to first SD while
running the iteration loop over tl's.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/fc473527cbc4dfa0b8eeef2a59db74684eb59a83.1370436120.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time
back and the comments/Documentation never got updated.
I figured it out when I was going through sched-domains.txt and so thought of
fixing it globally.
I haven't crossed check if the stuff that is referenced in sched/core.c by all
these files is still present and hasn't changed as that wasn't the motive behind
this patch.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
default_cfs_period(), do_sched_cfs_period_timer(), do_sched_cfs_slack_timer()
already defined previously, no need to declare again.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51AD8808.7020608@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Directly use rq to save some code.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51AD87EB.1070605@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ Peter, this is based off of some of my work, I ran it though a few
tests and it passed. I also reviewed it, and added my SOB as I am
somewhat a co-author to it. ]
Based on the patch by Steven Rostedt from previous year:
https://lkml.org/lkml/2012/4/18/517
1)Simplify pull_rt_task() logic: search in pushable tasks of dest runqueue.
The only pullable tasks are the tasks which are pushable in their local rq,
and no others.
2)Remove .leaf_rt_rq_list member of struct rt_rq and functions connected
with it: nobody uses it since now.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/287571370557898@web7d.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave Jones hit the following bug report:
===============================
[ INFO: suspicious RCU usage. ]
3.10.0-rc2+ #1 Not tainted
-------------------------------
include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
other info that might help us debug this:
RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
2 locks held by cc1/63645:
#0: (&rq->lock){-.-.-.}, at: [<ffffffff816b39fd>] __schedule+0xed/0x9b0
#1: (rcu_read_lock){.+.+..}, at: [<ffffffff8109d645>] cpuacct_charge+0x5/0x1f0
CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369]
Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28
ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500
0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645
Call Trace:
[<ffffffff816ae383>] dump_stack+0x19/0x1b
[<ffffffff810b698d>] lockdep_rcu_suspicious+0xfd/0x130
[<ffffffff8109d7c5>] cpuacct_charge+0x185/0x1f0
[<ffffffff8109d645>] ? cpuacct_charge+0x5/0x1f0
[<ffffffff8108dffc>] update_curr+0xec/0x240
[<ffffffff8108f528>] put_prev_task_fair+0x228/0x480
[<ffffffff816b3a71>] __schedule+0x161/0x9b0
[<ffffffff816b4721>] preempt_schedule+0x51/0x80
[<ffffffff816b4800>] ? __cond_resched_softirq+0x60/0x60
[<ffffffff816b6824>] ? retint_careful+0x12/0x2e
[<ffffffff810ff3cc>] ftrace_ops_control_func+0x1dc/0x210
[<ffffffff816be280>] ftrace_call+0x5/0x2f
[<ffffffff816b681d>] ? retint_careful+0xb/0x2e
[<ffffffff816b4805>] ? schedule_user+0x5/0x70
[<ffffffff816b4805>] ? schedule_user+0x5/0x70
[<ffffffff816b6824>] ? retint_careful+0x12/0x2e
------------[ cut here ]------------
What happened was that the function tracer traced the schedule_user() code
that tells RCU that the system is coming back from userspace, and to
add the CPU back to the RCU monitoring.
Because the function tracer does a preempt_disable/enable_notrace() calls
the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set,
then preempt_schedule() is called. But this is called before the user_exit()
function can inform the kernel that the CPU is no longer in user mode and
needs to be accounted for by RCU.
The fix is to create a new preempt_schedule_context() that checks if
the kernel is still in user mode and if so to switch it to kernel mode
before calling schedule. It also switches back to user mode coming back
from schedule in need be.
The only user of this currently is the preempt_enable_notrace(), which is
only used by the tracing subsystem.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369423420.6828.226.camel@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:
CPU 0 and CPU 1 are running tasks and CPU 2 is idle
CPU 1 kicks the Idle Load Balance
CPU 1 selects CPU 2 as the new Idle Load Balancer
CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
CPU 2 sends a reschedule IPI to CPU 2
While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2
CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
task A quickly goes back to sleep (before a tick occurs on CPU 2)
CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.
We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.
The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.
Change since V1:
- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
can't run on this CPU (as suggested by Peter)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This allows us to use pdev->name for registering a PMU device.
IMO the name is not supposed to be changed anyway.
Signed-off-by: Mischa Jonker <mjonker@synopsys.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370339148-5566-1-git-send-email-mjonker@synopsys.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>